RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-02 Thread Liu, Yi L
Hi Jason,

> From: Jason Gunthorpe 
> Sent: Thursday, April 1, 2021 7:54 PM
> 
> On Thu, Apr 01, 2021 at 07:04:01AM +0000, Liu, Yi L wrote:
> 
> > After reading your reply in https://lore.kernel.org/linux-
> iommu/20210331123801.gd1463...@nvidia.com/#t
> > So you mean /dev/ioasid FD is per-VM instead of per-ioasid, so above
> skeleton
> > doesn't suit your idea.
> 
> You can do it one PASID per FD or multiple PASID's per FD. Most likely
> we will have high numbers of PASID's in a qemu process so I assume
> that number of FDs will start to be a contraining factor, thus
> multiplexing is reasonable.
> 
> It doesn't really change anything about the basic flow.
> 
> digging deeply into it either seems like a reasonable choice.
> 
> > +-+---+
> > |  userspace  |   kernel space  
> >   |
> > +-+---+
> > | ioasid_fd = | /dev/ioasid does below: 
> >   |
> > | open("/dev/ioasid", O_RDWR);|   struct ioasid_fd_ctx {
> >   |
> > | |   struct list_head ioasid_list; 
> >   |
> > | |   ...   
> >   |
> > | |   } ifd_ctx; // ifd_ctx is per ioasid_fd
> >   |
> 
> Sure, possibly an xarray not a list
> 
> > +-+---+
> > | ioctl(ioasid_fd,| /dev/ioasid does below: 
> >   |
> > |   ALLOC, );  |   struct ioasid_data {  
> >   |
> > | |   ioasid_t ioasid;  
> >   |
> > | |   struct list_head device_list; 
> >   |
> > | |   struct list_head next;
> >   |
> > | |   ...   
> >   |
> > | |   } id_data; // id_data is per ioasid   
> >   |
> > | | 
> >   |
> > | |   list_add(_data.next,   
> >   |
> > | |_ctx.ioasid_list);
> > |
> 
> Yes, this should have a kref in it too
> 
> > +-+---+
> > | ioctl(device_fd,| VFIO does below:
> >   |
> > |   DEVICE_ALLOW_IOASID,  | 1) get ioasid_fd, check if ioasid_fd is 
> > valid |
> > |   ioasid_fd,| 2) check if ioasid is allocated from 
> > ioasid_fd|
> > |   ioasid);  | 3) register device/domain info to 
> > /dev/ioasid |
> > | |tracked in id_data.device_list   
> >   |
> > | | 4) record the ioasid in VFIO's per-device   
> >   |
> > | |ioasid list for future security check
> >   |
> 
> You would provide a function that does steps 1&2 look at eventfd for
> instance.
> 
> I'm not sure we need to register the device with the ioasid. device
> should incr the kref on the ioasid_data at this point.
> 
> > +-+---+
> > | ioctl(ioasid_fd,| /dev/ioasid does below: 
> >   |
> > |   BIND_PGTBL,   | 1) find ioasid's id_data
> >   |
> > |   pgtbl_data,   | 2) loop the id_data.device_list and tell 
> > iommu|
> > |   ioasid);  |give ioasid access to the devices
> > |
> 
> This seems backwards, DEVICE_ALLOW_IOASID should tell the iommu to
> give the ioasid to the device.
> 
> Here the ioctl should be about assigning a memory map from the the
> current
> mm_struct to the pasid
> 
> > +-+---+
> > | ioctl(ioasid_fd,| /dev/ioasid does below: 
> >   |
> > |   UNBIND_PGTBL, | 1) find ioasid's id_data
> >   |
> > |   ioasid);  | 2) loop the id_data.device_list and tell 
> > iommu|
> > |  

RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Thursday, April 1, 2021 7:47 PM
[...]
> I'm worried Intel views the only use of PASID in a guest is with
> ENQCMD, but that is not consistent with the industry. We need to see
> normal nested PASID support with assigned PCI VFs.

I'm not quire flow here. Intel also allows PASID usage in guest without
ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without ENQCMD.

[...]

> I'm sure there will be some small differences, and you should clearly
> explain the entire uAPI surface so that soneone from AMD and ARM can
> review it.

good suggestion, will do.

> > - this per-ioasid SVA operations is not aligned with the native SVA usage
> >   model. Native SVA bind is per-device.
> 
> Seems like that is an error in native SVA.
> 
> SVA is a particular mode of the PASID's memory mapping table, it has
> nothing to do with a device.

I think it still has relationship with device. This is determined by the
DMA remapping hierarchy in hardware. e.g. Intel VT-d, the DMA isolation is
enforced first in device granularity and then PASID granularity. SVA makes
usage of both PASID and device granularity isolation.

Regards,
Yi Liu

> Jason


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Thursday, April 1, 2021 9:43 PM
> 
> On Thu, Apr 01, 2021 at 01:38:46PM +0000, Liu, Yi L wrote:
> > > From: Jean-Philippe Brucker 
> > > Sent: Thursday, April 1, 2021 8:05 PM
> > [...]
> > >
> > > Also wondering about:
> > >
> > > * Querying IOMMU nesting capabilities before binding page tables
> (which
> > >   page table formats are supported?). We were planning to have a VFIO
> cap,
> > >   but I'm guessing we need to go back to the sysfs solution?
> >
> > I think it can also be with /dev/ioasid.
> 
> Sure, anything to do with page table formats and setting page tables
> should go through ioasid.
> 
> > > * Invalidation, probably an ioasid_fd ioctl?
> >
> > yeah, if we are doing bind/unbind_pagtable via ioasid_fd, then yes,
> > invalidation should go this way as well. This is why I worried it may
> > fail to meet the requirement from you and Eric.
> 
> Yes, all manipulation of page tables, including removing memory ranges, or
> setting memory ranges to trigger a page fault behavior should go
> through here.
> 
> > > * Page faults, page response. From and to devices, and don't necessarily
> > >   have a PASID. But needed by vdpa as well, so that's also going through
> > >   /dev/ioasid?
> >
> > page faults should still be per-device, but the fault event fd may be stored
> > in /dev/ioasid. page response would be in /dev/ioasid just like 
> > invalidation.
> 
> Here you mean non-SVA page faults that are delegated to userspace to
> handle?

no, just SVA page faults. otherwise, no need to let userspace handle.

> 
> Why would that be per-device?
>
> Can you show the flow you imagine?

DMA page faults are delivered to root-complex via page request message and
it is per-device according to PCIe spec. Page request handling flow is:

1) iommu driver receives a page request from device
2) iommu driver parses the page request message. Get the RID,PASID, faulted
   page and requested permissions etc.
3) iommu driver triggers fault handler registered by device driver with
   iommu_report_device_fault()
4) device driver's fault handler signals an event FD to notify userspace to
   fetch the information about the page fault. If it's VM case, inject the
   page fault to VM and let guest to solve it.

Eric has sent below series for the page fault reporting for VM with passthru
device.
https://lore.kernel.org/kvm/20210223210625.604517-5-eric.au...@redhat.com/

Regards,
Yi Liu

> Jason


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Thursday, April 1, 2021 9:16 PM
> 
> On Thu, Apr 01, 2021 at 01:10:48PM +0000, Liu, Yi L wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Thursday, April 1, 2021 7:47 PM
> > [...]
> > > I'm worried Intel views the only use of PASID in a guest is with
> > > ENQCMD, but that is not consistent with the industry. We need to see
> > > normal nested PASID support with assigned PCI VFs.
> >
> > I'm not quire flow here. Intel also allows PASID usage in guest without
> > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> ENQCMD.
> 
> Then you need all the parts, the hypervisor calls from the vIOMMU, and
> you can't really use a vPASID.

This is a diagram shows the vSVA setup.

.-.  .---.
|   vIOMMU|  | Guest process CR3, FL only|
| |  '---'
./
| PASID Entry |--- PASID cache flush -
'-'   |
| |   V
| |CR3 in GPA
'-'
Guest
--| Shadow |--|
  vv  v
Host
.-.  .--.
|   pIOMMU|  | Bind FL for GVA-GPA  |
| |  '--'
./  |
| PASID Entry | V (Nested xlate)
'\.--.
| |   |SL for GPA-HPA, default domain|
| |   '--'
'-'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables

https://lore.kernel.org/linux-iommu/20210302203545.436623-1-yi.l@intel.com/

> 
> I'm not sure how Intel intends to resolve all of this.
> 
> > > > - this per-ioasid SVA operations is not aligned with the native SVA
> usage
> > > >   model. Native SVA bind is per-device.
> > >
> > > Seems like that is an error in native SVA.
> > >
> > > SVA is a particular mode of the PASID's memory mapping table, it has
> > > nothing to do with a device.
> >
> > I think it still has relationship with device. This is determined by the
> > DMA remapping hierarchy in hardware. e.g. Intel VT-d, the DMA isolation
> is
> > enforced first in device granularity and then PASID granularity. SVA makes
> > usage of both PASID and device granularity isolation.
> 
> When the device driver authorizes a PASID the VT-d stuff should setup
> the isolation parameters for the give pci_device and PASID.

yes, both device and PASID is needed to setup VT-d stuff.

> Do not leak implementation details like this as uAPI. Authorization
> and memory map are distinct ideas with distinct interfaces. Do not mix
> them.

got you. Let's focus on the uAPI things here and leave implementation details
in RFC patches.

Thanks,
Yi Liu

> Jason


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Liu, Yi L
> From: Jean-Philippe Brucker 
> Sent: Thursday, April 1, 2021 8:05 PM
[...]
> 
> Also wondering about:
> 
> * Querying IOMMU nesting capabilities before binding page tables (which
>   page table formats are supported?). We were planning to have a VFIO cap,
>   but I'm guessing we need to go back to the sysfs solution?

I think it can also be with /dev/ioasid.

> 
> * Invalidation, probably an ioasid_fd ioctl?

yeah, if we are doing bind/unbind_pagtable via ioasid_fd, then yes,
invalidation should go this way as well. This is why I worried it may
fail to meet the requirement from you and Eric.

> * Page faults, page response. From and to devices, and don't necessarily
>   have a PASID. But needed by vdpa as well, so that's also going through
>   /dev/ioasid?

page faults should still be per-device, but the fault event fd may be stored
in /dev/ioasid. page response would be in /dev/ioasid just like invalidation.

Regards,
Yi Liu

> 
> Thanks,
> Jean


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Liu, Yi L
Hi Jason,

> From: Liu, Yi L 
> Sent: Thursday, April 1, 2021 12:39 PM
> 
> > From: Jason Gunthorpe 
> > Sent: Wednesday, March 31, 2021 8:41 PM
> >
> > On Wed, Mar 31, 2021 at 07:38:36AM +, Liu, Yi L wrote:
> >
> > > The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
> > > the VM should be able to be shared by all assigned device for the VM.
> > > But the SVA operations (bind/unbind page table, cache_invalidate)
> should
> > > be per-device.
> >
> > It is not *per-device* it is *per-ioasid*
> >
> > And as /dev/ioasid is an interface for controlling multiple ioasid's
> > there is no issue to also multiplex the page table manipulation for
> > multiple ioasids as well.
> >
> > What you should do next is sketch out in some RFC the exactl ioctls
> > each FD would have and show how the parts I outlined would work and
> > point out any remaining gaps.
> >
> > The device FD is something like the vfio_device FD from VFIO, it has
> > *nothing* to do with PASID beyond having a single ioctl to authorize
> > the device to use the PASID. All control of the PASID is in
> > /dev/ioasid.
> 
> good to see this reply. Your idea is much clearer to me now. If I'm getting
> you correctly. I think the skeleton is something like below:
> f
> 1) userspace opens a /dev/ioasid, meanwhile there will be an ioasid
>allocated and a per-ioasid context which can be used to do bind page
>table and cache invalidate, an ioasid FD returned to userspace.
> 2) userspace passes the ioasid FD to VFIO, let it associated with a device
>FD (like vfio_device FD).
> 3) userspace binds page table on the ioasid FD with the page table info.
> 4) userspace unbinds the page table on the ioasid FD
> 5) userspace de-associates the ioasid FD and device FD
> 
> Does above suit your outline?
> 
> If yes, I still have below concern and wish to see your opinion.
> - the ioasid FD and device association will happen at runtime instead of
>   just happen in the setup phase.
> - how about AMD and ARM's vSVA support? Their PASID allocation and page
> table
>   happens within guest. They only need to bind the guest PASID table to
> host.
>   Above model seems unable to fit them. (Jean, Eric, Jacob please feel free
>   to correct me)
> - this per-ioasid SVA operations is not aligned with the native SVA usage
>   model. Native SVA bind is per-device.

After reading your reply in 
https://lore.kernel.org/linux-iommu/20210331123801.gd1463...@nvidia.com/#t
So you mean /dev/ioasid FD is per-VM instead of per-ioasid, so above skeleton
doesn't suit your idea. I draft below skeleton to see if our mind is the
same. But I still believe there is an open on how to fit ARM and AMD's
vSVA support in this the per-ioasid SVA operation model. thoughts?

+-+---+
|  userspace  |   kernel space|
+-+---+
| ioasid_fd = | /dev/ioasid does below:   |
| open("/dev/ioasid", O_RDWR);|   struct ioasid_fd_ctx {  |
| |   struct list_head ioasid_list;   |
| |   ... |
| |   } ifd_ctx; // ifd_ctx is per ioasid_fd  |
+-+---+
| ioctl(ioasid_fd,| /dev/ioasid does below:   |
|   ALLOC, );  |   struct ioasid_data {|
| |   ioasid_t ioasid;|
| |   struct list_head device_list;   |
| |   struct list_head next;  |
| |   ... |
| |   } id_data; // id_data is per ioasid |
| |   |
| |   list_add(_data.next, |
| |_ctx.ioasid_list); |
+-+---+
| ioctl(device_fd,| VFIO does below:  |
|   DEVICE_ALLOW_IOASID,  | 1) get ioasid_fd, check if ioasid_fd is valid |
|   ioasid_fd,| 2) check if ioasid is allocated from ioasid_fd|
|   ioasid);  | 3) register device/domain info to /dev/ioasid |
| |tracked in id_data.device_list  

RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Wednesday, March 31, 2021 8:41 PM
> 
> On Wed, Mar 31, 2021 at 07:38:36AM +0000, Liu, Yi L wrote:
> 
> > The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
> > the VM should be able to be shared by all assigned device for the VM.
> > But the SVA operations (bind/unbind page table, cache_invalidate) should
> > be per-device.
> 
> It is not *per-device* it is *per-ioasid*
>
> And as /dev/ioasid is an interface for controlling multiple ioasid's
> there is no issue to also multiplex the page table manipulation for
> multiple ioasids as well.
> 
> What you should do next is sketch out in some RFC the exactl ioctls
> each FD would have and show how the parts I outlined would work and
> point out any remaining gaps.
> 
> The device FD is something like the vfio_device FD from VFIO, it has
> *nothing* to do with PASID beyond having a single ioctl to authorize
> the device to use the PASID. All control of the PASID is in
> /dev/ioasid.

good to see this reply. Your idea is much clearer to me now. If I'm getting
you correctly. I think the skeleton is something like below:

1) userspace opens a /dev/ioasid, meanwhile there will be an ioasid
   allocated and a per-ioasid context which can be used to do bind page
   table and cache invalidate, an ioasid FD returned to userspace.
2) userspace passes the ioasid FD to VFIO, let it associated with a device
   FD (like vfio_device FD).
3) userspace binds page table on the ioasid FD with the page table info.
4) userspace unbinds the page table on the ioasid FD
5) userspace de-associates the ioasid FD and device FD

Does above suit your outline?

If yes, I still have below concern and wish to see your opinion.
- the ioasid FD and device association will happen at runtime instead of
  just happen in the setup phase.
- how about AMD and ARM's vSVA support? Their PASID allocation and page table
  happens within guest. They only need to bind the guest PASID table to host.
  Above model seems unable to fit them. (Jean, Eric, Jacob please feel free
  to correct me)
- this per-ioasid SVA operations is not aligned with the native SVA usage
  model. Native SVA bind is per-device.

Regards,
Yi Liu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Tuesday, March 30, 2021 9:43 PM
[..]
> No, the mdev device driver must enforce this directly. It is the one
> that programms the physical shared HW, it is the one that needs a list
> of PASID's it is allowed to program *for each mdev*
> 
> ioasid_set doesn't seem to help at all, certainly not as a concept
> tied to /dev/ioasid.
> 

As replied in another thread. We introduced ioasid_set based on the
motivation to have per-VM ioasid track, which is required when user
space tries to bind an ioasid with a device. Should ensure the ioasid
it is using was allocated to it. otherwise, we may suffer inter-VM ioasid
problem. It may not necessaty to be ioasid_set but a per-VM ioasid track
is necessary.

Regards,
Yi Liu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Tuesday, March 30, 2021 9:28 PM
> 
> On Tue, Mar 30, 2021 at 04:14:58AM +, Tian, Kevin wrote:
> 
> > One correction. The mdev should still construct the list of allowed PASID's
> as
> > you said (by listening to IOASID_BIND/UNBIND event), in addition to the
> ioasid
> > set maintained per VM (updated when a PASID is allocated/freed). The
> per-VM
> > set is required for inter-VM isolation (verified when a pgtable is bound to
> the
> > mdev/PASID), while the mdev's own list is necessary for intra-VM isolation
> when
> > multiple mdevs are assigned to the same VM (verified before loading a
> PASID
> > to the mdev). This series just handles the general part i.e. per-VM ioasid
> set and
> > leaves the mdev's own list to be managed by specific mdev driver which
> listens
> > to various IOASID events).
> 
> This is better, but I don't understand why we need such a convoluted
> design.
> 
> Get rid of the ioasid set.
>
> Each driver has its own list of allowed ioasids.

First, I agree with you it's necessary to have a per-device allowed ioasid
list. But besides it, I think we still need to ensure the ioasid used by a
VM is really allocated to this VM. A VM should not use an ioasid allocated
to another VM. right? Actually, this is the major intention for introducing
ioasid_set.

> Register a ioasid in the driver's list by passing the fd and ioasid #

The fd here is a device fd. Am I right? If yes, your idea is ioasid is
allocated via /dev/ioasid and associated with device fd via either VFIO
or vDPA ioctl. right? sorry I may be asking silly questions but really
need to ensure we are talking in the same page.

> No listening to events. A simple understandable security model.

For this suggestion, I have a little bit concern if we may have A-B/B-A
lock sequence issue since it requires the /dev/ioasid (if it supports)
to call back into VFIO/VDPA to check if the ioasid has been registered to
device FD and record it in the per-device list. right? Let's have more
discussion based on the skeleton sent by Kevin.

Regards,
Yi Liu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Liu, Yi L
Hi Jason,

> From: Jason Gunthorpe 
> Sent: Tuesday, March 30, 2021 9:29 PM
> 
> On Tue, Mar 30, 2021 at 01:37:05AM +, Tian, Kevin wrote:
[...]
> > Hi, Jason,
> >
> > Actually above is a major open while we are refactoring vSVA uAPI toward
> > this direction. We have two concerns about merging /dev/ioasid with
> > /dev/sva, and would like to hear your thought whether they are valid.
> >
> > First, userspace may use ioasid in a non-SVA scenario where ioasid is
> > bound to specific security context (e.g. a control vq in vDPA) instead of
> > tying to mm. In this case there is no pgtable binding initiated from user
> > space. Instead, ioasid is allocated from /dev/ioasid and then programmed
> > to the intended security context through specific passthrough framework
> > which manages that context.
> 
> This sounds like the exact opposite of what I'd like to see.
> 
> I do not want to see every subsystem gaining APIs to program a
> PASID. All of that should be consolidated in *one place*.
> 
> I do not want to see VDPA and VFIO have two nearly identical sets of
> APIs to control the PASID.
> 
> Drivers consuming a PASID, like VDPA, should consume the PASID and do
> nothing more than authorize the HW to use it.
> 
> quemu should have general code under the viommu driver that drives
> /dev/ioasid to create PASID's and manage the IO mapping according to
> the guest's needs.
> 
> Drivers like VDPA and VFIO should simply accept that PASID and
> configure/authorize their HW to do DMA's with its tag.
> 
> > Second, ioasid is managed per process/VM while pgtable binding is a
> > device-wise operation.  The userspace flow looks like below for an integral
> > /dev/ioasid interface:
> >
> > - ioctl(container->fd, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU)
> > - ioasid_fd = open(/dev/ioasid)
> > - ioctl(ioasid_fd, IOASID_GET_USVA_FD, _fd) //an empty context
> > - ioctl(device->fd, VFIO_DEVICE_SET_SVA, _fd); //sva_fd ties to
> device
> > - ioctl(sva_fd, USVA_GET_INFO, _info);
> > - ioctl(ioasid_fd, IOMMU_ALLOC_IOASID, );
> > - ioctl(sva_fd, USVA_BIND_PGTBL, _data);
> > - ioctl(sva_fd, USVA_FLUSH_CACHE, _info);
> > - ioctl(sva_fd, USVA_UNBIND_PGTBL, _data);
> > - ioctl(device->fd, VFIO_DEVICE_UNSET_SVA, _fd);
> > - close(sva_fd)
> > - close(ioasid_fd)
> >
> > Our hesitation here is based on one of your earlier comments that
> > you are not a fan of constructing fd's through ioctl. Are you OK with
> > above flow or have a better idea of handling it?
> 
> My reaction is to squash 'sva' and ioasid fds together, I can't see
> why you'd need two fds to manipulate a PASID.

The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
the VM should be able to be shared by all assigned device for the VM.
But the SVA operations (bind/unbind page table, cache_invalidate) should
be per-device. If squashing the two fds to be one, then requires a device
tag for each vSVA ioctl. I'm not sure if it is good. Per me, it looks
better to have a SVA FD and associated with a device FD so that any ioctl
on it will be in the device level. This also benefits ARM and AMD's vSVA
support since they binds guest PASID table to host instead of binding
guest page tables to specific PASIDs.

Regards,
Yi Liu


RE: [PATCH V4 00/18] IOASID extensions for guest SVA

2021-03-02 Thread Liu, Yi L
> From: Jacob Pan 
> Sent: Sunday, February 28, 2021 6:01 AM
>
> I/O Address Space ID (IOASID) core code was introduced in v5.5 as a generic
> kernel allocator service for both PCIe Process Address Space ID (PASID) and
> ARM SMMU's Substream ID. IOASIDs are used to associate DMA requests
> with
> virtual address spaces, including both host and guest.
> 
> In addition to providing basic ID allocation, ioasid_set was defined as a
> token that is shared by a group of IOASIDs. This set token can be used
> for permission checking, but lack some features to address the following
> needs by guest Shared Virtual Address (SVA).
> - Manage IOASIDs by group, group ownership, quota, etc.
> - State synchronization among IOASID users (e.g. IOMMU driver, KVM,
> device
> drivers)
> - Non-identity guest-host IOASID mapping
> - Lifecycle management
> 
> This patchset introduces the following extensions as solutions to the
> problems above.
> - Redefine and extend IOASID set such that IOASIDs can be managed by
> groups/pools.
> - Add notifications for IOASID state synchronization
> - Extend reference counting for life cycle alignment among multiple users
> - Support ioasid_set private IDs, which can be used as guest IOASIDs
> - Add a new cgroup controller for resource distribution
> 
> Please refer to Documentation/admin-guide/cgroup-v1/ioasids.rst and
> Documentation/driver-api/ioasid.rst in the enclosed patches for more
> details.
> 
> Based on discussions on LKML[1], a direction change was made in v4 such
> that
> the user interfaces for IOASID allocation are extracted from VFIO
> subsystem. The proposed IOASID subsystem now consists of three
> components:
> 1. IOASID core[01-14]: provides APIs for allocation, pool management,
>   notifications, and refcounting.
> 2. IOASID cgroup controller[RFC 15-17]: manage resource distribution[2].
> 3. IOASID user[RFC 18]:  provides user allocation interface via /dev/ioasid
> 
> This patchset only included VT-d driver as users of some of the new APIs.
> VFIO and KVM patches are coming up to fully utilize the APIs introduced
> here.
>
> [1] https://lore.kernel.org/linux-iommu/1599734733-6431-1-git-send-email-
> yi.l@intel.com/
> [2] Note that ioasid quota management code can be removed once the
> IOASIDs
> cgroup is ratified.
> 
> You can find this series, VFIO, KVM, and IOASID user at:
> https://github.com/jacobpan/linux.git ioasid_v4
> (VFIO and KVM patches will be available at this branch when published.)

VFIO and QEMU series are listed below:

VFIO: 
https://lore.kernel.org/linux-iommu/20210302203545.436623-1-yi.l@intel.com/
QEMU: 
https://lore.kernel.org/qemu-devel/20210302203827.437645-1-yi.l@intel.com/T/#t

Regards,
Yi Liu



RE: [RFC PATCH v1 0/4] vfio: Add IOPF support for VFIO passthrough

2021-02-09 Thread Liu, Yi L
> From: Tian, Kevin 
> Sent: Thursday, February 4, 2021 2:52 PM
> 
> > From: Shenming Lu 
> > Sent: Tuesday, February 2, 2021 2:42 PM
> >
> > On 2021/2/1 15:56, Tian, Kevin wrote:
> > >> From: Alex Williamson 
> > >> Sent: Saturday, January 30, 2021 6:58 AM
> > >>
> > >> On Mon, 25 Jan 2021 17:03:58 +0800
> > >> Shenming Lu  wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> The static pinning and mapping problem in VFIO and possible
> solutions
> > >>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> > >>> page fault support for VFIO devices. Different from those relatively
> > >>> complicated software approaches such as presenting a vIOMMU that
> > >> provides
> > >>> the DMA buffer information (might include para-virtualized
> > optimizations),
> > >>> IOPF mainly depends on the hardware faulting capability, such as the
> > PCIe
> > >>> PRI extension or Arm SMMU stall model. What's more, the IOPF
> support
> > in
> > >>> the IOMMU driver is being implemented in SVA [3]. So do we
> consider to
> > >>> add IOPF support for VFIO passthrough based on the IOPF part of SVA
> at
> > >>> present?
> > >>>
> > >>> We have implemented a basic demo only for one stage of translation
> > (GPA
> > >>> -> HPA in virtualization, note that it can be configured at either 
> > >>> stage),
> > >>> and tested on Hisilicon Kunpeng920 board. The nested mode is more
> > >> complicated
> > >>> since VFIO only handles the second stage page faults (same as the
> non-
> > >> nested
> > >>> case), while the first stage page faults need to be further delivered to
> > >>> the guest, which is being implemented in [4] on ARM. My thought on
> this
> > >>> is to report the page faults to VFIO regardless of the occured stage
> (try
> > >>> to carry the stage information), and handle respectively according to
> the
> > >>> configured mode in VFIO. Or the IOMMU driver might evolve to
> support
> > >> more...
> > >>>
> > >>> Might TODO:
> > >>>  - Optimize the faulting path, and measure the performance (it might
> still
> > >>>be a big issue).
> > >>>  - Add support for PRI.
> > >>>  - Add a MMU notifier to avoid pinning.
> > >>>  - Add support for the nested mode.
> > >>> ...
> > >>>
> > >>> Any comments and suggestions are very welcome. :-)
> > >>
> > >> I expect performance to be pretty bad here, the lookup involved per
> > >> fault is excessive.  There are cases where a user is not going to be
> > >> willing to have a slow ramp up of performance for their devices as they
> > >> fault in pages, so we might need to considering making this
> > >> configurable through the vfio interface.  Our page mapping also only
> > >
> > > There is another factor to be considered. The presence of IOMMU_
> > > DEV_FEAT_IOPF just indicates the device capability of triggering I/O
> > > page fault through the IOMMU, but not exactly means that the device
> > > can tolerate I/O page fault for arbitrary DMA requests.
> >
> > Yes, so I add a iopf_enabled field in VFIO to indicate the whole path
> faulting
> > capability and set it to true after registering a VFIO page fault handler.
> >
> > > In reality, many
> > > devices allow I/O faulting only in selective contexts. However, there
> > > is no standard way (e.g. PCISIG) for the device to report whether
> > > arbitrary I/O fault is allowed. Then we may have to maintain device
> > > specific knowledge in software, e.g. in an opt-in table to list devices
> > > which allows arbitrary faults. For devices which only support selective
> > > faulting, a mediator (either through vendor extensions on vfio-pci-core
> > > or a mdev wrapper) might be necessary to help lock down non-faultable
> > > mappings and then enable faulting on the rest mappings.
> >
> > For devices which only support selective faulting, they could tell it to the
> > IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?
> 
> Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
> selectively page-pinning. The matter is that 'they' imply some device
> specific logic to decide which pages must be pinned and such knowledge
> is outside of VFIO.
> 
> From enabling p.o.v we could possibly do it in phased approach. First
> handles devices which tolerate arbitrary DMA faults, and then extends
> to devices with selective-faulting. The former is simpler, but with one
> main open whether we want to maintain such device IDs in a static
> table in VFIO or rely on some hints from other components (e.g. PF
> driver in VF assignment case). Let's see how Alex thinks about it.
> 
> >
> > >
> > >> grows here, should mappings expire or do we need a least recently
> > >> mapped tracker to avoid exceeding the user's locked memory limit?
> How
> > >> does a user know what to set for a locked memory limit?  The behavior
> > >> here would lead to cases where an idle system might be ok, but as
> soon
> > >> as load increases with more inflight DMA, we start seeing
> > >> "unpredictable" I/O faults from the user 

[PATCH v4 3/3] iommu/vt-d: Fix ineffective devTLB invalidation for subdevices

2021-01-06 Thread Liu Yi L
iommu_flush_dev_iotlb() is called to invalidate caches on device. It only
loops the devices which are full-attached to the domain. For sub-devices,
this is ineffective. This results in invalid caching entries left on the
device. Fix it by adding loop for subdevices as well. Also, the domain->
has_iotlb_device needs to be updated when attaching to subdevices.

Fixes: 67b8e02b5e761 ("iommu/vt-d: Aux-domain specific domain attach/detach")
Signed-off-by: Liu Yi L 
Acked-by: Lu Baolu 
---
 drivers/iommu/intel/iommu.c | 53 +++--
 1 file changed, 37 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index d7720a8..65cf06d 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -719,6 +719,8 @@ static int domain_update_device_node(struct dmar_domain 
*domain)
return nid;
 }
 
+static void domain_update_iotlb(struct dmar_domain *domain);
+
 /* Some capabilities may be different across iommus */
 static void domain_update_iommu_cap(struct dmar_domain *domain)
 {
@@ -744,6 +746,8 @@ static void domain_update_iommu_cap(struct dmar_domain 
*domain)
domain->domain.geometry.aperture_end = 
__DOMAIN_MAX_ADDR(domain->gaw - 1);
else
domain->domain.geometry.aperture_end = 
__DOMAIN_MAX_ADDR(domain->gaw);
+
+   domain_update_iotlb(domain);
 }
 
 struct context_entry *iommu_context_addr(struct intel_iommu *iommu, u8 bus,
@@ -1464,17 +1468,22 @@ static void domain_update_iotlb(struct dmar_domain 
*domain)
 
assert_spin_locked(_domain_lock);
 
-   list_for_each_entry(info, >devices, link) {
-   struct pci_dev *pdev;
-
-   if (!info->dev || !dev_is_pci(info->dev))
-   continue;
-
-   pdev = to_pci_dev(info->dev);
-   if (pdev->ats_enabled) {
+   list_for_each_entry(info, >devices, link)
+   if (info->ats_enabled) {
has_iotlb_device = true;
break;
}
+
+   if (!has_iotlb_device) {
+   struct subdev_domain_info *sinfo;
+
+   list_for_each_entry(sinfo, >subdevices, link_domain) {
+   info = get_domain_info(sinfo->pdev);
+   if (info && info->ats_enabled) {
+   has_iotlb_device = true;
+   break;
+   }
+   }
}
 
domain->has_iotlb_device = has_iotlb_device;
@@ -1555,25 +1564,37 @@ static void iommu_disable_dev_iotlb(struct 
device_domain_info *info)
 #endif
 }
 
+static void __iommu_flush_dev_iotlb(struct device_domain_info *info,
+   u64 addr, unsigned int mask)
+{
+   u16 sid, qdep;
+
+   if (!info || !info->ats_enabled)
+   return;
+
+   sid = info->bus << 8 | info->devfn;
+   qdep = info->ats_qdep;
+   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
+  qdep, addr, mask);
+}
+
 static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
  u64 addr, unsigned mask)
 {
-   u16 sid, qdep;
unsigned long flags;
struct device_domain_info *info;
+   struct subdev_domain_info *sinfo;
 
if (!domain->has_iotlb_device)
return;
 
spin_lock_irqsave(_domain_lock, flags);
-   list_for_each_entry(info, >devices, link) {
-   if (!info->ats_enabled)
-   continue;
+   list_for_each_entry(info, >devices, link)
+   __iommu_flush_dev_iotlb(info, addr, mask);
 
-   sid = info->bus << 8 | info->devfn;
-   qdep = info->ats_qdep;
-   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
-   qdep, addr, mask);
+   list_for_each_entry(sinfo, >subdevices, link_domain) {
+   info = get_domain_info(sinfo->pdev);
+   __iommu_flush_dev_iotlb(info, addr, mask);
}
spin_unlock_irqrestore(_domain_lock, flags);
 }
-- 
2.7.4



[PATCH v4 2/3] iommu/vt-d: Track device aux-attach with subdevice_domain_info

2021-01-06 Thread Liu Yi L
In the existing code, loop all devices attached to a domain does not
include sub-devices attached via iommu_aux_attach_device().

This was found by when I'm working on the below patch, There is no
device in the domain->devices list, thus unable to get the cap and
ecap of iommu unit. But this domain actually has subdevice which is
attached via aux-manner. But it is tracked by domain. This patch is
going to fix it.

https://lore.kernel.org/kvm/1599734733-6431-17-git-send-email-yi.l@intel.com/

And this fix goes beyond the patch above, such sub-device tracking is
necessary for other cases. For example, flushing device_iotlb for a
domain which has sub-devices attached by auxiliary manner.

Fixes: 67b8e02b5e761 ("iommu/vt-d: Aux-domain specific domain attach/detach")
Co-developed-by: Xin Zeng 
Signed-off-by: Xin Zeng 
Signed-off-by: Liu Yi L 
Acked-by: Lu Baolu 
---
 drivers/iommu/intel/iommu.c | 95 +
 include/linux/intel-iommu.h | 16 +---
 2 files changed, 82 insertions(+), 29 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 788119c..d7720a8 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1877,6 +1877,7 @@ static struct dmar_domain *alloc_domain(int flags)
domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
domain->has_iotlb_device = false;
INIT_LIST_HEAD(>devices);
+   INIT_LIST_HEAD(>subdevices);
 
return domain;
 }
@@ -2547,7 +2548,7 @@ static struct dmar_domain 
*dmar_insert_one_dev_info(struct intel_iommu *iommu,
info->iommu = iommu;
info->pasid_table = NULL;
info->auxd_enabled = 0;
-   INIT_LIST_HEAD(>auxiliary_domains);
+   INIT_LIST_HEAD(>subdevices);
 
if (dev && dev_is_pci(dev)) {
struct pci_dev *pdev = to_pci_dev(info->dev);
@@ -4475,33 +4476,61 @@ is_aux_domain(struct device *dev, struct iommu_domain 
*domain)
domain->type == IOMMU_DOMAIN_UNMANAGED;
 }
 
-static void auxiliary_link_device(struct dmar_domain *domain,
- struct device *dev)
+static inline struct subdev_domain_info *
+lookup_subdev_info(struct dmar_domain *domain, struct device *dev)
+{
+   struct subdev_domain_info *sinfo;
+
+   if (!list_empty(>subdevices)) {
+   list_for_each_entry(sinfo, >subdevices, link_domain) {
+   if (sinfo->pdev == dev)
+   return sinfo;
+   }
+   }
+
+   return NULL;
+}
+
+static int auxiliary_link_device(struct dmar_domain *domain,
+struct device *dev)
 {
struct device_domain_info *info = get_domain_info(dev);
+   struct subdev_domain_info *sinfo = lookup_subdev_info(domain, dev);
 
assert_spin_locked(_domain_lock);
if (WARN_ON(!info))
-   return;
+   return -EINVAL;
+
+   if (!sinfo) {
+   sinfo = kzalloc(sizeof(*sinfo), GFP_ATOMIC);
+   sinfo->domain = domain;
+   sinfo->pdev = dev;
+   list_add(>link_phys, >subdevices);
+   list_add(>link_domain, >subdevices);
+   }
 
-   domain->auxd_refcnt++;
-   list_add(>auxd, >auxiliary_domains);
+   return ++sinfo->users;
 }
 
-static void auxiliary_unlink_device(struct dmar_domain *domain,
-   struct device *dev)
+static int auxiliary_unlink_device(struct dmar_domain *domain,
+  struct device *dev)
 {
struct device_domain_info *info = get_domain_info(dev);
+   struct subdev_domain_info *sinfo = lookup_subdev_info(domain, dev);
+   int ret;
 
assert_spin_locked(_domain_lock);
-   if (WARN_ON(!info))
-   return;
+   if (WARN_ON(!info || !sinfo || sinfo->users <= 0))
+   return -EINVAL;
 
-   list_del(>auxd);
-   domain->auxd_refcnt--;
+   ret = --sinfo->users;
+   if (!ret) {
+   list_del(>link_phys);
+   list_del(>link_domain);
+   kfree(sinfo);
+   }
 
-   if (!domain->auxd_refcnt && domain->default_pasid > 0)
-   ioasid_put(domain->default_pasid);
+   return ret;
 }
 
 static int aux_domain_add_dev(struct dmar_domain *domain,
@@ -4530,6 +4559,19 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
}
 
spin_lock_irqsave(_domain_lock, flags);
+   ret = auxiliary_link_device(domain, dev);
+   if (ret <= 0)
+   goto link_failed;
+
+   /*
+* Subdevices from the same physical device can be attached to the
+* same domain. For such cases, only the first subdevice attachment
+* needs to go through the full steps in this function. So if ret >
+ 

[PATCH v4 1/3] iommu/vt-d: Move intel_iommu info from struct intel_svm to struct intel_svm_dev

2021-01-06 Thread Liu Yi L
Current struct intel_svm has a field to record the struct intel_iommu
pointer for a PASID bind. And struct intel_svm will be shared by all
the devices bind to the same process. The devices may be behind different
DMAR units. As the iommu driver code uses the intel_iommu pointer stored
in intel_svm struct to do cache invalidations, it may only flush the cache
on a single DMAR unit, for others, the cache invalidation is missed.

As intel_svm struct already has a device list, this patch just moves the
intel_iommu pointer to be a field of intel_svm_dev struct.

Fixes: 1c4f88b7f1f92 ("iommu/vt-d: Shared virtual address in scalable mode")
Cc: Lu Baolu 
Cc: Jacob Pan 
Cc: Raj Ashok 
Cc: David Woodhouse 
Reported-by: Guo Kaijie 
Reported-by: Xin Zeng 
Signed-off-by: Guo Kaijie 
Signed-off-by: Xin Zeng 
Signed-off-by: Liu Yi L 
Tested-by: Guo Kaijie 
Cc: sta...@vger.kernel.org # v5.0+
Acked-by: Lu Baolu 
---
 drivers/iommu/intel/svm.c   | 9 +
 include/linux/intel-iommu.h | 2 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 4fa248b..6956669 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -142,7 +142,7 @@ static void intel_flush_svm_range_dev (struct intel_svm 
*svm, struct intel_svm_d
}
desc.qw2 = 0;
desc.qw3 = 0;
-   qi_submit_sync(svm->iommu, , 1, 0);
+   qi_submit_sync(sdev->iommu, , 1, 0);
 
if (sdev->dev_iotlb) {
desc.qw0 = QI_DEV_EIOTLB_PASID(svm->pasid) |
@@ -166,7 +166,7 @@ static void intel_flush_svm_range_dev (struct intel_svm 
*svm, struct intel_svm_d
}
desc.qw2 = 0;
desc.qw3 = 0;
-   qi_submit_sync(svm->iommu, , 1, 0);
+   qi_submit_sync(sdev->iommu, , 1, 0);
}
 }
 
@@ -211,7 +211,7 @@ static void intel_mm_release(struct mmu_notifier *mn, 
struct mm_struct *mm)
 */
rcu_read_lock();
list_for_each_entry_rcu(sdev, >devs, list)
-   intel_pasid_tear_down_entry(svm->iommu, sdev->dev,
+   intel_pasid_tear_down_entry(sdev->iommu, sdev->dev,
svm->pasid, true);
rcu_read_unlock();
 
@@ -363,6 +363,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
struct device *dev,
}
sdev->dev = dev;
sdev->sid = PCI_DEVID(info->bus, info->devfn);
+   sdev->iommu = iommu;
 
/* Only count users if device has aux domains */
if (iommu_dev_feature_enabled(dev, IOMMU_DEV_FEAT_AUX))
@@ -546,6 +547,7 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
goto out;
}
sdev->dev = dev;
+   sdev->iommu = iommu;
 
ret = intel_iommu_enable_pasid(iommu, dev);
if (ret) {
@@ -575,7 +577,6 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
kfree(sdev);
goto out;
}
-   svm->iommu = iommu;
 
if (pasid_max > intel_pasid_max_id)
pasid_max = intel_pasid_max_id;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index d956987..9452268 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -758,6 +758,7 @@ struct intel_svm_dev {
struct list_head list;
struct rcu_head rcu;
struct device *dev;
+   struct intel_iommu *iommu;
struct svm_dev_ops *ops;
struct iommu_sva sva;
u32 pasid;
@@ -771,7 +772,6 @@ struct intel_svm {
struct mmu_notifier notifier;
struct mm_struct *mm;
 
-   struct intel_iommu *iommu;
unsigned int flags;
u32 pasid;
int gpasid; /* In case that guest PASID is different from host PASID */
-- 
2.7.4



[PATCH v4 0/3] iommu/vt-d: Misc fixes on scalable mode

2021-01-06 Thread Liu Yi L
Hi Baolu, Joerg, Will,

This patchset aims to fix a bug regards to native SVM usage, and
also two bugs around subdevice (attached to device via auxiliary
manner) tracking and ineffective device_tlb flush.

v3 -> v4:
- Address comments from Baolu Lu and add acked-by
- Fix issue reported by "Dan Carpenter" and "kernel test robot"
- Add tested-by from Guo Kaijie on patch 1/3
- Rebase to 5.11-rc2
v3: 
https://lore.kernel.org/linux-iommu/20201229032513.486395-1-yi.l@intel.com/

v2 -> v3:
- Address comments from Baolu Lu against v2
- Rebased to 5.11-rc1
v2: 
https://lore.kernel.org/linux-iommu/20201223062720.29364-1-yi.l@intel.com/

v1 -> v2:
- Use a more recent Fix tag in "iommu/vt-d: Move intel_iommu info from struct 
intel_svm to struct intel_svm_dev"
- Refined the "iommu/vt-d: Track device aux-attach with subdevice_domain_info"
- Rename "iommu/vt-d: A fix to iommu_flush_dev_iotlb() for aux-domain" to be
  "iommu/vt-d: Fix ineffective devTLB invalidation for subdevices"
- Refined the commit messages
v1: 
https://lore.kernel.org/linux-iommu/2020122352.183523-1-yi.l@intel.com/

Regards,
Yi Liu

Liu Yi L (3):
  iommu/vt-d: Move intel_iommu info from struct intel_svm to struct
intel_svm_dev
  iommu/vt-d: Track device aux-attach with subdevice_domain_info
  iommu/vt-d: Fix ineffective devTLB invalidation for subdevices

 drivers/iommu/intel/iommu.c | 148 
 drivers/iommu/intel/svm.c   |   9 +--
 include/linux/intel-iommu.h |  18 --
 3 files changed, 125 insertions(+), 50 deletions(-)

-- 
2.7.4



RE: [PATCH v3 3/3] iommu/vt-d: Fix ineffective devTLB invalidation for subdevices

2021-01-06 Thread Liu, Yi L
Hi Will,

> From: Will Deacon 
> Sent: Wednesday, January 6, 2021 1:24 AM
> 
> On Tue, Jan 05, 2021 at 05:50:22AM +0000, Liu, Yi L wrote:
> > > > +static void __iommu_flush_dev_iotlb(struct device_domain_info
> *info,
> > > > +   u64 addr, unsigned int mask)
> > > > +{
> > > > +   u16 sid, qdep;
> > > > +
> > > > +   if (!info || !info->ats_enabled)
> > > > +   return;
> > > > +
> > > > +   sid = info->bus << 8 | info->devfn;
> > > > +   qdep = info->ats_qdep;
> > > > +   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
> > > > +  qdep, addr, mask);
> > > > +}
> > > > +
> > > >   static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
> > > >   u64 addr, unsigned mask)
> > > >   {
> > > > -   u16 sid, qdep;
> > > > unsigned long flags;
> > > > struct device_domain_info *info;
> > > > +   struct subdev_domain_info *sinfo;
> > > >
> > > > if (!domain->has_iotlb_device)
> > > > return;
> > > >
> > > > spin_lock_irqsave(_domain_lock, flags);
> > > > -   list_for_each_entry(info, >devices, link) {
> > > > -   if (!info->ats_enabled)
> > > > -   continue;
> > > > +   list_for_each_entry(info, >devices, link)
> > > > +   __iommu_flush_dev_iotlb(info, addr, mask);
> > > >
> > > > -   sid = info->bus << 8 | info->devfn;
> > > > -   qdep = info->ats_qdep;
> > > > -   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
> > > > -   qdep, addr, mask);
> > > > +   list_for_each_entry(sinfo, >subdevices, link_domain) {
> > > > +   __iommu_flush_dev_iotlb(get_domain_info(sinfo->pdev),
> > > > +   addr, mask);
> > > > }
> > >
> > > Nit:
> > >   list_for_each_entry(sinfo, >subdevices, link_domain) {
> > >   info = get_domain_info(sinfo->pdev);
> > >   __iommu_flush_dev_iotlb(info, addr, mask);
> > >   }
> >
> > you are right. this should be better.
> 
> Please can you post a v4, with Lu's acks and the issue reported by Dan fixed
> too?

sure, will send out later.

Regards,
Yi Liu

> Thanks,
> 
> Will


RE: [PATCH v3 2/3] iommu/vt-d: Track device aux-attach with subdevice_domain_info

2021-01-04 Thread Liu, Yi L
Hi Baolu,

> From: Lu Baolu 
> Sent: Tuesday, December 29, 2020 4:38 PM
> 
> Hi Yi,
> 
> On 2020/12/29 11:25, Liu Yi L wrote:
> > In the existing code, loop all devices attached to a domain does not
> > include sub-devices attached via iommu_aux_attach_device().
> >
> > This was found by when I'm working on the belwo patch, There is no
>  ^
> below

nice catch. 

> > device in the domain->devices list, thus unable to get the cap and
> > ecap of iommu unit. But this domain actually has subdevice which is
> > attached via aux-manner. But it is tracked by domain. This patch is
> > going to fix it.
> >
> > https://lore.kernel.org/kvm/1599734733-6431-17-git-send-email-
> yi.l@intel.com/
> >
> > And this fix goes beyond the patch above, such sub-device tracking is
> > necessary for other cases. For example, flushing device_iotlb for a
> > domain which has sub-devices attached by auxiliary manner.
> >
> > Co-developed-by: Xin Zeng 
> > Signed-off-by: Xin Zeng 
> > Signed-off-by: Liu Yi L 
> 
> Others look good to me.
> 
> Fixes: 67b8e02b5e761 ("iommu/vt-d: Aux-domain specific domain
> attach/detach")
> Acked-by: Lu Baolu 

thanks,

Regards,
Yi Liu

> Best regards,
> baolu
> 
> > ---
> >   drivers/iommu/intel/iommu.c | 95 +++
> --
> >   include/linux/intel-iommu.h | 16 +--
> >   2 files changed, 82 insertions(+), 29 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 788119c5b021..d7720a836268 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -1877,6 +1877,7 @@ static struct dmar_domain *alloc_domain(int
> flags)
> > domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
> > domain->has_iotlb_device = false;
> > INIT_LIST_HEAD(>devices);
> > +   INIT_LIST_HEAD(>subdevices);
> >
> > return domain;
> >   }
> > @@ -2547,7 +2548,7 @@ static struct dmar_domain
> *dmar_insert_one_dev_info(struct intel_iommu *iommu,
> > info->iommu = iommu;
> > info->pasid_table = NULL;
> > info->auxd_enabled = 0;
> > -   INIT_LIST_HEAD(>auxiliary_domains);
> > +   INIT_LIST_HEAD(>subdevices);
> >
> > if (dev && dev_is_pci(dev)) {
> > struct pci_dev *pdev = to_pci_dev(info->dev);
> > @@ -4475,33 +4476,61 @@ is_aux_domain(struct device *dev, struct
> iommu_domain *domain)
> > domain->type == IOMMU_DOMAIN_UNMANAGED;
> >   }
> >
> > -static void auxiliary_link_device(struct dmar_domain *domain,
> > - struct device *dev)
> > +static inline struct subdev_domain_info *
> > +lookup_subdev_info(struct dmar_domain *domain, struct device *dev)
> > +{
> > +   struct subdev_domain_info *sinfo;
> > +
> > +   if (!list_empty(>subdevices)) {
> > +   list_for_each_entry(sinfo, >subdevices,
> link_domain) {
> > +   if (sinfo->pdev == dev)
> > +   return sinfo;
> > +   }
> > +   }
> > +
> > +   return NULL;
> > +}
> > +
> > +static int auxiliary_link_device(struct dmar_domain *domain,
> > +struct device *dev)
> >   {
> > struct device_domain_info *info = get_domain_info(dev);
> > +   struct subdev_domain_info *sinfo = lookup_subdev_info(domain,
> dev);
> >
> > assert_spin_locked(_domain_lock);
> > if (WARN_ON(!info))
> > -   return;
> > +   return -EINVAL;
> > +
> > +   if (!sinfo) {
> > +   sinfo = kzalloc(sizeof(*sinfo), GFP_ATOMIC);
> > +   sinfo->domain = domain;
> > +   sinfo->pdev = dev;
> > +   list_add(>link_phys, >subdevices);
> > +   list_add(>link_domain, >subdevices);
> > +   }
> >
> > -   domain->auxd_refcnt++;
> > -   list_add(>auxd, >auxiliary_domains);
> > +   return ++sinfo->users;
> >   }
> >
> > -static void auxiliary_unlink_device(struct dmar_domain *domain,
> > -   struct device *dev)
> > +static int auxiliary_unlink_device(struct dmar_domain *domain,
> > +  struct device *dev)
> >   {
> > struct device_domain_info *info = get_domain_info(dev);
> > +   struct subdev_domain_info *sinfo = lookup_subdev_info(domain,

RE: [PATCH v3 3/3] iommu/vt-d: Fix ineffective devTLB invalidation for subdevices

2021-01-04 Thread Liu, Yi L
Hi Baolu,

> From: Lu Baolu 
> Sent: Tuesday, December 29, 2020 4:42 PM
> 
> Hi Yi,
> 
> On 2020/12/29 11:25, Liu Yi L wrote:
> > iommu_flush_dev_iotlb() is called to invalidate caches on device. It only
> > loops the devices which are full-attached to the domain. For sub-devices,
> > this is ineffective. This results in invalid caching entries left on the
> > device. Fix it by adding loop for subdevices as well. Also, the domain->
> > has_iotlb_device needs to be updated when attaching to subdevices.
> >
> > Fixes: 67b8e02b5e761 ("iommu/vt-d: Aux-domain specific domain
> attach/detach")
> > Signed-off-by: Liu Yi L 
> > ---
> >   drivers/iommu/intel/iommu.c | 53 ++-
> --
> >   1 file changed, 37 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index d7720a836268..d48a60b61ba6 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -719,6 +719,8 @@ static int domain_update_device_node(struct
> dmar_domain *domain)
> > return nid;
> >   }
> >
> > +static void domain_update_iotlb(struct dmar_domain *domain);
> > +
> >   /* Some capabilities may be different across iommus */
> >   static void domain_update_iommu_cap(struct dmar_domain *domain)
> >   {
> > @@ -744,6 +746,8 @@ static void domain_update_iommu_cap(struct
> dmar_domain *domain)
> > domain->domain.geometry.aperture_end =
> __DOMAIN_MAX_ADDR(domain->gaw - 1);
> > else
> > domain->domain.geometry.aperture_end =
> __DOMAIN_MAX_ADDR(domain->gaw);
> > +
> > +   domain_update_iotlb(domain);
> >   }
> >
> >   struct context_entry *iommu_context_addr(struct intel_iommu *iommu,
> u8 bus,
> > @@ -1464,17 +1468,22 @@ static void domain_update_iotlb(struct
> dmar_domain *domain)
> >
> > assert_spin_locked(_domain_lock);
> >
> > -   list_for_each_entry(info, >devices, link) {
> > -   struct pci_dev *pdev;
> > -
> > -   if (!info->dev || !dev_is_pci(info->dev))
> > -   continue;
> > -
> > -   pdev = to_pci_dev(info->dev);
> > -   if (pdev->ats_enabled) {
> > +   list_for_each_entry(info, >devices, link)
> > +   if (info && info->ats_enabled) {
> > has_iotlb_device = true;
> > break;
> > }
> > +
> > +   if (!has_iotlb_device) {
> > +   struct subdev_domain_info *sinfo;
> > +
> > +   list_for_each_entry(sinfo, >subdevices,
> link_domain) {
> > +   info = get_domain_info(sinfo->pdev);
> > +   if (info && info->ats_enabled) {
> > +   has_iotlb_device = true;
> > +   break;
> > +   }
> > +   }
> > }
> >
> > domain->has_iotlb_device = has_iotlb_device;
> > @@ -1555,25 +1564,37 @@ static void iommu_disable_dev_iotlb(struct
> device_domain_info *info)
> >   #endif
> >   }
> >
> > +static void __iommu_flush_dev_iotlb(struct device_domain_info *info,
> > +   u64 addr, unsigned int mask)
> > +{
> > +   u16 sid, qdep;
> > +
> > +   if (!info || !info->ats_enabled)
> > +   return;
> > +
> > +   sid = info->bus << 8 | info->devfn;
> > +   qdep = info->ats_qdep;
> > +   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
> > +  qdep, addr, mask);
> > +}
> > +
> >   static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
> >   u64 addr, unsigned mask)
> >   {
> > -   u16 sid, qdep;
> > unsigned long flags;
> > struct device_domain_info *info;
> > +   struct subdev_domain_info *sinfo;
> >
> > if (!domain->has_iotlb_device)
> > return;
> >
> > spin_lock_irqsave(_domain_lock, flags);
> > -   list_for_each_entry(info, >devices, link) {
> > -   if (!info->ats_enabled)
> > -   continue;
> > +   list_for_each_entry(info, >devices, link)
> > +   __iommu_flush_dev_iotlb(info, addr, mask);
> >
> > -   sid = info->bus << 8 | info->devfn;
> > -   qdep = info->ats_qdep;
> > -   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
> > -   qdep, addr, mask);
> > +   list_for_each_entry(sinfo, >subdevices, link_domain) {
> > +   __iommu_flush_dev_iotlb(get_domain_info(sinfo->pdev),
> > +   addr, mask);
> > }
> 
> Nit:
>   list_for_each_entry(sinfo, >subdevices, link_domain) {
>   info = get_domain_info(sinfo->pdev);
>   __iommu_flush_dev_iotlb(info, addr, mask);
>   }

you are right. this should be better.

> Others look good to me.
>
> Acked-by: Lu Baolu 
> 
> Best regards,
> baolu

Regards,
Yi Liu

> > spin_unlock_irqrestore(_domain_lock, flags);
> >   }
> >


[PATCH v3 3/3] iommu/vt-d: Fix ineffective devTLB invalidation for subdevices

2020-12-28 Thread Liu Yi L
iommu_flush_dev_iotlb() is called to invalidate caches on device. It only
loops the devices which are full-attached to the domain. For sub-devices,
this is ineffective. This results in invalid caching entries left on the
device. Fix it by adding loop for subdevices as well. Also, the domain->
has_iotlb_device needs to be updated when attaching to subdevices.

Fixes: 67b8e02b5e761 ("iommu/vt-d: Aux-domain specific domain attach/detach")
Signed-off-by: Liu Yi L 
---
 drivers/iommu/intel/iommu.c | 53 ++---
 1 file changed, 37 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index d7720a836268..d48a60b61ba6 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -719,6 +719,8 @@ static int domain_update_device_node(struct dmar_domain 
*domain)
return nid;
 }
 
+static void domain_update_iotlb(struct dmar_domain *domain);
+
 /* Some capabilities may be different across iommus */
 static void domain_update_iommu_cap(struct dmar_domain *domain)
 {
@@ -744,6 +746,8 @@ static void domain_update_iommu_cap(struct dmar_domain 
*domain)
domain->domain.geometry.aperture_end = 
__DOMAIN_MAX_ADDR(domain->gaw - 1);
else
domain->domain.geometry.aperture_end = 
__DOMAIN_MAX_ADDR(domain->gaw);
+
+   domain_update_iotlb(domain);
 }
 
 struct context_entry *iommu_context_addr(struct intel_iommu *iommu, u8 bus,
@@ -1464,17 +1468,22 @@ static void domain_update_iotlb(struct dmar_domain 
*domain)
 
assert_spin_locked(_domain_lock);
 
-   list_for_each_entry(info, >devices, link) {
-   struct pci_dev *pdev;
-
-   if (!info->dev || !dev_is_pci(info->dev))
-   continue;
-
-   pdev = to_pci_dev(info->dev);
-   if (pdev->ats_enabled) {
+   list_for_each_entry(info, >devices, link)
+   if (info && info->ats_enabled) {
has_iotlb_device = true;
break;
}
+
+   if (!has_iotlb_device) {
+   struct subdev_domain_info *sinfo;
+
+   list_for_each_entry(sinfo, >subdevices, link_domain) {
+   info = get_domain_info(sinfo->pdev);
+   if (info && info->ats_enabled) {
+   has_iotlb_device = true;
+   break;
+   }
+   }
}
 
domain->has_iotlb_device = has_iotlb_device;
@@ -1555,25 +1564,37 @@ static void iommu_disable_dev_iotlb(struct 
device_domain_info *info)
 #endif
 }
 
+static void __iommu_flush_dev_iotlb(struct device_domain_info *info,
+   u64 addr, unsigned int mask)
+{
+   u16 sid, qdep;
+
+   if (!info || !info->ats_enabled)
+   return;
+
+   sid = info->bus << 8 | info->devfn;
+   qdep = info->ats_qdep;
+   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
+  qdep, addr, mask);
+}
+
 static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
  u64 addr, unsigned mask)
 {
-   u16 sid, qdep;
unsigned long flags;
struct device_domain_info *info;
+   struct subdev_domain_info *sinfo;
 
if (!domain->has_iotlb_device)
return;
 
spin_lock_irqsave(_domain_lock, flags);
-   list_for_each_entry(info, >devices, link) {
-   if (!info->ats_enabled)
-   continue;
+   list_for_each_entry(info, >devices, link)
+   __iommu_flush_dev_iotlb(info, addr, mask);
 
-   sid = info->bus << 8 | info->devfn;
-   qdep = info->ats_qdep;
-   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
-   qdep, addr, mask);
+   list_for_each_entry(sinfo, >subdevices, link_domain) {
+   __iommu_flush_dev_iotlb(get_domain_info(sinfo->pdev),
+   addr, mask);
}
spin_unlock_irqrestore(_domain_lock, flags);
 }
-- 
2.25.1



[PATCH v3 2/3] iommu/vt-d: Track device aux-attach with subdevice_domain_info

2020-12-28 Thread Liu Yi L
In the existing code, loop all devices attached to a domain does not
include sub-devices attached via iommu_aux_attach_device().

This was found by when I'm working on the belwo patch, There is no
device in the domain->devices list, thus unable to get the cap and
ecap of iommu unit. But this domain actually has subdevice which is
attached via aux-manner. But it is tracked by domain. This patch is
going to fix it.

https://lore.kernel.org/kvm/1599734733-6431-17-git-send-email-yi.l@intel.com/

And this fix goes beyond the patch above, such sub-device tracking is
necessary for other cases. For example, flushing device_iotlb for a
domain which has sub-devices attached by auxiliary manner.

Co-developed-by: Xin Zeng 
Signed-off-by: Xin Zeng 
Signed-off-by: Liu Yi L 
---
 drivers/iommu/intel/iommu.c | 95 +++--
 include/linux/intel-iommu.h | 16 +--
 2 files changed, 82 insertions(+), 29 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 788119c5b021..d7720a836268 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1877,6 +1877,7 @@ static struct dmar_domain *alloc_domain(int flags)
domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
domain->has_iotlb_device = false;
INIT_LIST_HEAD(>devices);
+   INIT_LIST_HEAD(>subdevices);
 
return domain;
 }
@@ -2547,7 +2548,7 @@ static struct dmar_domain 
*dmar_insert_one_dev_info(struct intel_iommu *iommu,
info->iommu = iommu;
info->pasid_table = NULL;
info->auxd_enabled = 0;
-   INIT_LIST_HEAD(>auxiliary_domains);
+   INIT_LIST_HEAD(>subdevices);
 
if (dev && dev_is_pci(dev)) {
struct pci_dev *pdev = to_pci_dev(info->dev);
@@ -4475,33 +4476,61 @@ is_aux_domain(struct device *dev, struct iommu_domain 
*domain)
domain->type == IOMMU_DOMAIN_UNMANAGED;
 }
 
-static void auxiliary_link_device(struct dmar_domain *domain,
- struct device *dev)
+static inline struct subdev_domain_info *
+lookup_subdev_info(struct dmar_domain *domain, struct device *dev)
+{
+   struct subdev_domain_info *sinfo;
+
+   if (!list_empty(>subdevices)) {
+   list_for_each_entry(sinfo, >subdevices, link_domain) {
+   if (sinfo->pdev == dev)
+   return sinfo;
+   }
+   }
+
+   return NULL;
+}
+
+static int auxiliary_link_device(struct dmar_domain *domain,
+struct device *dev)
 {
struct device_domain_info *info = get_domain_info(dev);
+   struct subdev_domain_info *sinfo = lookup_subdev_info(domain, dev);
 
assert_spin_locked(_domain_lock);
if (WARN_ON(!info))
-   return;
+   return -EINVAL;
+
+   if (!sinfo) {
+   sinfo = kzalloc(sizeof(*sinfo), GFP_ATOMIC);
+   sinfo->domain = domain;
+   sinfo->pdev = dev;
+   list_add(>link_phys, >subdevices);
+   list_add(>link_domain, >subdevices);
+   }
 
-   domain->auxd_refcnt++;
-   list_add(>auxd, >auxiliary_domains);
+   return ++sinfo->users;
 }
 
-static void auxiliary_unlink_device(struct dmar_domain *domain,
-   struct device *dev)
+static int auxiliary_unlink_device(struct dmar_domain *domain,
+  struct device *dev)
 {
struct device_domain_info *info = get_domain_info(dev);
+   struct subdev_domain_info *sinfo = lookup_subdev_info(domain, dev);
+   int ret;
 
assert_spin_locked(_domain_lock);
-   if (WARN_ON(!info))
-   return;
+   if (WARN_ON(!info || !sinfo || sinfo->users <= 0))
+   return -EINVAL;
 
-   list_del(>auxd);
-   domain->auxd_refcnt--;
+   ret = --sinfo->users;
+   if (!ret) {
+   list_del(>link_phys);
+   list_del(>link_domain);
+   kfree(sinfo);
+   }
 
-   if (!domain->auxd_refcnt && domain->default_pasid > 0)
-   ioasid_put(domain->default_pasid);
+   return ret;
 }
 
 static int aux_domain_add_dev(struct dmar_domain *domain,
@@ -4530,6 +4559,19 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
}
 
spin_lock_irqsave(_domain_lock, flags);
+   ret = auxiliary_link_device(domain, dev);
+   if (ret <= 0)
+   goto link_failed;
+
+   /*
+* Subdevices from the same physical device can be attached to the
+* same domain. For such cases, only the first subdevice attachment
+* needs to go through the full steps in this function. So if ret >
+* 1, just goto out.
+*/
+   if (ret > 1)
+   goto out;
+
/*
 * iomm

[PATCH v3 0/3] iommu/vt-d: Misc fixes on scalable mode

2020-12-28 Thread Liu Yi L
Hi Baolu, Joerg, Will,

This patchset aims to fix a bug regards to native SVM usage, and
also several bugs around subdevice (attached to device via auxiliary
manner) tracking and ineffective device_tlb flush.

v2 -> v3:
- Address comments from Baolu Lu against v2
- Rebased to 5.11-rc1
v2: 
https://lore.kernel.org/linux-iommu/20201223062720.29364-1-yi.l@intel.com/

v1 -> v2:
- Use a more recent Fix tag in "iommu/vt-d: Move intel_iommu info from struct 
intel_svm to struct intel_svm_dev"
- Refined the "iommu/vt-d: Track device aux-attach with subdevice_domain_info"
- Rename "iommu/vt-d: A fix to iommu_flush_dev_iotlb() for aux-domain" to be
  "iommu/vt-d: Fix ineffective devTLB invalidation for subdevices"
- Refined the commit messages
v1: 
https://lore.kernel.org/linux-iommu/2020122352.183523-1-yi.l@intel.com/

Liu Yi L (3):
  iommu/vt-d: Move intel_iommu info from struct intel_svm to struct
intel_svm_dev
  iommu/vt-d: Track device aux-attach with subdevice_domain_info
  iommu/vt-d: Fix ineffective devTLB invalidation for subdevices

 drivers/iommu/intel/iommu.c | 148 ++--
 drivers/iommu/intel/svm.c   |   9 ++-
 include/linux/intel-iommu.h |  18 +++--
 3 files changed, 125 insertions(+), 50 deletions(-)

-- 
2.25.1



[PATCH v3 1/3] iommu/vt-d: Move intel_iommu info from struct intel_svm to struct intel_svm_dev

2020-12-28 Thread Liu Yi L
Current struct intel_svm has a field to record the struct intel_iommu
pointer for a PASID bind. And struct intel_svm will be shared by all
the devices bind to the same process. The devices may be behind different
DMAR units. As the iommu driver code uses the intel_iommu pointer stored
in intel_svm struct to do cache invalidations, it may only flush the cache
on a single DMAR unit, for others, the cache invalidation is missed.

As intel_svm struct already has a device list, this patch just moves the
intel_iommu pointer to be a field of intel_svm_dev struct.

Fixes: 1c4f88b7f1f92 ("iommu/vt-d: Shared virtual address in scalable mode")
Cc: Lu Baolu 
Cc: Jacob Pan 
Cc: Raj Ashok 
Cc: David Woodhouse 
Reported-by: Guo Kaijie 
Reported-by: Xin Zeng 
Signed-off-by: Guo Kaijie 
Signed-off-by: Xin Zeng 
Signed-off-by: Liu Yi L 
---
 drivers/iommu/intel/svm.c   | 9 +
 include/linux/intel-iommu.h | 2 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 4fa248b98031..69566695d032 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -142,7 +142,7 @@ static void intel_flush_svm_range_dev (struct intel_svm 
*svm, struct intel_svm_d
}
desc.qw2 = 0;
desc.qw3 = 0;
-   qi_submit_sync(svm->iommu, , 1, 0);
+   qi_submit_sync(sdev->iommu, , 1, 0);
 
if (sdev->dev_iotlb) {
desc.qw0 = QI_DEV_EIOTLB_PASID(svm->pasid) |
@@ -166,7 +166,7 @@ static void intel_flush_svm_range_dev (struct intel_svm 
*svm, struct intel_svm_d
}
desc.qw2 = 0;
desc.qw3 = 0;
-   qi_submit_sync(svm->iommu, , 1, 0);
+   qi_submit_sync(sdev->iommu, , 1, 0);
}
 }
 
@@ -211,7 +211,7 @@ static void intel_mm_release(struct mmu_notifier *mn, 
struct mm_struct *mm)
 */
rcu_read_lock();
list_for_each_entry_rcu(sdev, >devs, list)
-   intel_pasid_tear_down_entry(svm->iommu, sdev->dev,
+   intel_pasid_tear_down_entry(sdev->iommu, sdev->dev,
svm->pasid, true);
rcu_read_unlock();
 
@@ -363,6 +363,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
struct device *dev,
}
sdev->dev = dev;
sdev->sid = PCI_DEVID(info->bus, info->devfn);
+   sdev->iommu = iommu;
 
/* Only count users if device has aux domains */
if (iommu_dev_feature_enabled(dev, IOMMU_DEV_FEAT_AUX))
@@ -546,6 +547,7 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
goto out;
}
sdev->dev = dev;
+   sdev->iommu = iommu;
 
ret = intel_iommu_enable_pasid(iommu, dev);
if (ret) {
@@ -575,7 +577,6 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
kfree(sdev);
goto out;
}
-   svm->iommu = iommu;
 
if (pasid_max > intel_pasid_max_id)
pasid_max = intel_pasid_max_id;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index d956987ed032..94522685a0d9 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -758,6 +758,7 @@ struct intel_svm_dev {
struct list_head list;
struct rcu_head rcu;
struct device *dev;
+   struct intel_iommu *iommu;
struct svm_dev_ops *ops;
struct iommu_sva sva;
u32 pasid;
@@ -771,7 +772,6 @@ struct intel_svm {
struct mmu_notifier notifier;
struct mm_struct *mm;
 
-   struct intel_iommu *iommu;
unsigned int flags;
u32 pasid;
int gpasid; /* In case that guest PASID is different from host PASID */
-- 
2.25.1



RE: [PATCH v2 3/3] iommu/vt-d: Fix ineffective devTLB invalidation for subdevices

2020-12-25 Thread Liu, Yi L
Hi Baolu,

Well received, all comments accepted. thanks.

Regards,
Yi Liu

> From: Lu Baolu 
> Sent: Wednesday, December 23, 2020 6:10 PM
> 
> Hi Yi,
> 
> On 2020/12/23 14:27, Liu Yi L wrote:
> > iommu_flush_dev_iotlb() is called to invalidate caches on device. It only
> > loops the devices which are full-attached to the domain. For sub-devices,
> > this is ineffective. This results in invalid caching entries left on the
> > device. Fix it by adding loop for subdevices as well. Also, the domain->
> > has_iotlb_device needs to be updated when attaching to subdevices.
> >
> > Fixes: 67b8e02b5e761 ("iommu/vt-d: Aux-domain specific domain
> attach/detach")
> > Signed-off-by: Liu Yi L 
> > ---
> >   drivers/iommu/intel/iommu.c | 63 +++-
> -
> >   1 file changed, 47 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index acfe0a5b955e..e97c5ac1d7fc 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -726,6 +726,8 @@ static int domain_update_device_node(struct
> dmar_domain *domain)
> > return nid;
> >   }
> >
> > +static void domain_update_iotlb(struct dmar_domain *domain);
> > +
> >   /* Some capabilities may be different across iommus */
> >   static void domain_update_iommu_cap(struct dmar_domain *domain)
> >   {
> > @@ -739,6 +741,8 @@ static void domain_update_iommu_cap(struct
> dmar_domain *domain)
> >  */
> > if (domain->nid == NUMA_NO_NODE)
> > domain->nid = domain_update_device_node(domain);
> > +
> > +   domain_update_iotlb(domain);
> >   }
> >
> >   struct context_entry *iommu_context_addr(struct intel_iommu *iommu,
> u8 bus,
> > @@ -1459,6 +1463,18 @@ iommu_support_dev_iotlb (struct dmar_domain
> *domain, struct intel_iommu *iommu,
> > return NULL;
> >   }
> >
> > +static bool dev_iotlb_enabled(struct device_domain_info *info)
> > +{
> > +   struct pci_dev *pdev;
> > +
> > +   if (!info->dev || !dev_is_pci(info->dev))
> > +   return false;
> > +
> > +   pdev = to_pci_dev(info->dev);
> > +
> > +   return !!pdev->ats_enabled;
> > +}
> 
> I know this is just separated from below function. But isn't "(info &&
> info->ats_enabled)" is enough?
> 
> > +
> >   static void domain_update_iotlb(struct dmar_domain *domain)
> >   {
> > struct device_domain_info *info;
> > @@ -1466,17 +1482,20 @@ static void domain_update_iotlb(struct
> dmar_domain *domain)
> >
> > assert_spin_locked(_domain_lock);
> >
> > -   list_for_each_entry(info, >devices, link) {
> > -   struct pci_dev *pdev;
> > -
> > -   if (!info->dev || !dev_is_pci(info->dev))
> > -   continue;
> > -
> > -   pdev = to_pci_dev(info->dev);
> > -   if (pdev->ats_enabled) {
> > +   list_for_each_entry(info, >devices, link)
> > +   if (dev_iotlb_enabled(info)) {
> > has_iotlb_device = true;
> > break;
> > }
> > +
> > +   if (!has_iotlb_device) {
> > +   struct subdev_domain_info *sinfo;
> > +
> > +   list_for_each_entry(sinfo, >subdevices, link_domain)
> > +   if (dev_iotlb_enabled(get_domain_info(sinfo->pdev)))
> {
> 
> Please make the code easier for reading by:
> 
>   info = get_domain_info(sinfo->pdev);
>   if (dev_iotlb_enabled(info))
>   
> 
> Best regards,
> baolu
> 
> > +   has_iotlb_device = true;
> > +   break;
> > +   }
> > }
> >
> > domain->has_iotlb_device = has_iotlb_device;
> > @@ -1557,25 +1576,37 @@ static void iommu_disable_dev_iotlb(struct
> device_domain_info *info)
> >   #endif
> >   }
> >
> > +static void __iommu_flush_dev_iotlb(struct device_domain_info *info,
> > +   u64 addr, unsigned int mask)
> > +{
> > +   u16 sid, qdep;
> > +
> > +   if (!info || !info->ats_enabled)
> > +   return;
> > +
> > +   sid = info->bus << 8 | info->devfn;
> > +   qdep = info->ats_qdep;
> > +   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
> > +  qdep, addr, mask);
> >

[PATCH v2 1/3] iommu/vt-d: Move intel_iommu info from struct intel_svm to struct intel_svm_dev

2020-12-22 Thread Liu Yi L
Current struct intel_svm has a field to record the struct intel_iommu
pointer for a PASID bind. And struct intel_svm will be shared by all
the devices bind to the same process. The devices may be behind different
DMAR units. As the iommu driver code uses the intel_iommu pointer stored
in intel_svm struct to do cache invalidations, it may only flush the cache
on a single DMAR unit, for others, the cache invalidation is missed.

As intel_svm struct already has a device list, this patch just moves the
intel_iommu pointer to be a field of intel_svm_dev struct.

Fixes: 1c4f88b7f1f92 ("iommu/vt-d: Shared virtual address in scalable mode")
Cc: Lu Baolu 
Cc: Jacob Pan 
Cc: Raj Ashok 
Cc: David Woodhouse 
Reported-by: Guo Kaijie 
Reported-by: Xin Zeng 
Signed-off-by: Guo Kaijie 
Signed-off-by: Xin Zeng 
Signed-off-by: Liu Yi L 
Tested-by: Guo Kaijie 
---
 drivers/iommu/intel/svm.c   | 9 +
 include/linux/intel-iommu.h | 2 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 3242ebd0bca3..4a10c9ff368c 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -142,7 +142,7 @@ static void intel_flush_svm_range_dev (struct intel_svm 
*svm, struct intel_svm_d
}
desc.qw2 = 0;
desc.qw3 = 0;
-   qi_submit_sync(svm->iommu, , 1, 0);
+   qi_submit_sync(sdev->iommu, , 1, 0);
 
if (sdev->dev_iotlb) {
desc.qw0 = QI_DEV_EIOTLB_PASID(svm->pasid) |
@@ -166,7 +166,7 @@ static void intel_flush_svm_range_dev (struct intel_svm 
*svm, struct intel_svm_d
}
desc.qw2 = 0;
desc.qw3 = 0;
-   qi_submit_sync(svm->iommu, , 1, 0);
+   qi_submit_sync(sdev->iommu, , 1, 0);
}
 }
 
@@ -211,7 +211,7 @@ static void intel_mm_release(struct mmu_notifier *mn, 
struct mm_struct *mm)
 */
rcu_read_lock();
list_for_each_entry_rcu(sdev, >devs, list)
-   intel_pasid_tear_down_entry(svm->iommu, sdev->dev,
+   intel_pasid_tear_down_entry(sdev->iommu, sdev->dev,
svm->pasid, true);
rcu_read_unlock();
 
@@ -363,6 +363,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
struct device *dev,
}
sdev->dev = dev;
sdev->sid = PCI_DEVID(info->bus, info->devfn);
+   sdev->iommu = iommu;
 
/* Only count users if device has aux domains */
if (iommu_dev_feature_enabled(dev, IOMMU_DEV_FEAT_AUX))
@@ -546,6 +547,7 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
goto out;
}
sdev->dev = dev;
+   sdev->iommu = iommu;
 
ret = intel_iommu_enable_pasid(iommu, dev);
if (ret) {
@@ -575,7 +577,6 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
kfree(sdev);
goto out;
}
-   svm->iommu = iommu;
 
if (pasid_max > intel_pasid_max_id)
pasid_max = intel_pasid_max_id;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index d956987ed032..94522685a0d9 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -758,6 +758,7 @@ struct intel_svm_dev {
struct list_head list;
struct rcu_head rcu;
struct device *dev;
+   struct intel_iommu *iommu;
struct svm_dev_ops *ops;
struct iommu_sva sva;
u32 pasid;
@@ -771,7 +772,6 @@ struct intel_svm {
struct mmu_notifier notifier;
struct mm_struct *mm;
 
-   struct intel_iommu *iommu;
unsigned int flags;
u32 pasid;
int gpasid; /* In case that guest PASID is different from host PASID */
-- 
2.25.1



[PATCH v2 3/3] iommu/vt-d: Fix ineffective devTLB invalidation for subdevices

2020-12-22 Thread Liu Yi L
iommu_flush_dev_iotlb() is called to invalidate caches on device. It only
loops the devices which are full-attached to the domain. For sub-devices,
this is ineffective. This results in invalid caching entries left on the
device. Fix it by adding loop for subdevices as well. Also, the domain->
has_iotlb_device needs to be updated when attaching to subdevices.

Fixes: 67b8e02b5e761 ("iommu/vt-d: Aux-domain specific domain attach/detach")
Signed-off-by: Liu Yi L 
---
 drivers/iommu/intel/iommu.c | 63 +++--
 1 file changed, 47 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index acfe0a5b955e..e97c5ac1d7fc 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -726,6 +726,8 @@ static int domain_update_device_node(struct dmar_domain 
*domain)
return nid;
 }
 
+static void domain_update_iotlb(struct dmar_domain *domain);
+
 /* Some capabilities may be different across iommus */
 static void domain_update_iommu_cap(struct dmar_domain *domain)
 {
@@ -739,6 +741,8 @@ static void domain_update_iommu_cap(struct dmar_domain 
*domain)
 */
if (domain->nid == NUMA_NO_NODE)
domain->nid = domain_update_device_node(domain);
+
+   domain_update_iotlb(domain);
 }
 
 struct context_entry *iommu_context_addr(struct intel_iommu *iommu, u8 bus,
@@ -1459,6 +1463,18 @@ iommu_support_dev_iotlb (struct dmar_domain *domain, 
struct intel_iommu *iommu,
return NULL;
 }
 
+static bool dev_iotlb_enabled(struct device_domain_info *info)
+{
+   struct pci_dev *pdev;
+
+   if (!info->dev || !dev_is_pci(info->dev))
+   return false;
+
+   pdev = to_pci_dev(info->dev);
+
+   return !!pdev->ats_enabled;
+}
+
 static void domain_update_iotlb(struct dmar_domain *domain)
 {
struct device_domain_info *info;
@@ -1466,17 +1482,20 @@ static void domain_update_iotlb(struct dmar_domain 
*domain)
 
assert_spin_locked(_domain_lock);
 
-   list_for_each_entry(info, >devices, link) {
-   struct pci_dev *pdev;
-
-   if (!info->dev || !dev_is_pci(info->dev))
-   continue;
-
-   pdev = to_pci_dev(info->dev);
-   if (pdev->ats_enabled) {
+   list_for_each_entry(info, >devices, link)
+   if (dev_iotlb_enabled(info)) {
has_iotlb_device = true;
break;
}
+
+   if (!has_iotlb_device) {
+   struct subdev_domain_info *sinfo;
+
+   list_for_each_entry(sinfo, >subdevices, link_domain)
+   if (dev_iotlb_enabled(get_domain_info(sinfo->pdev))) {
+   has_iotlb_device = true;
+   break;
+   }
}
 
domain->has_iotlb_device = has_iotlb_device;
@@ -1557,25 +1576,37 @@ static void iommu_disable_dev_iotlb(struct 
device_domain_info *info)
 #endif
 }
 
+static void __iommu_flush_dev_iotlb(struct device_domain_info *info,
+   u64 addr, unsigned int mask)
+{
+   u16 sid, qdep;
+
+   if (!info || !info->ats_enabled)
+   return;
+
+   sid = info->bus << 8 | info->devfn;
+   qdep = info->ats_qdep;
+   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
+  qdep, addr, mask);
+}
+
 static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
  u64 addr, unsigned mask)
 {
-   u16 sid, qdep;
unsigned long flags;
struct device_domain_info *info;
+   struct subdev_domain_info *sinfo;
 
if (!domain->has_iotlb_device)
return;
 
spin_lock_irqsave(_domain_lock, flags);
-   list_for_each_entry(info, >devices, link) {
-   if (!info->ats_enabled)
-   continue;
+   list_for_each_entry(info, >devices, link)
+   __iommu_flush_dev_iotlb(info, addr, mask);
 
-   sid = info->bus << 8 | info->devfn;
-   qdep = info->ats_qdep;
-   qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
-   qdep, addr, mask);
+   list_for_each_entry(sinfo, >subdevices, link_domain) {
+   __iommu_flush_dev_iotlb(get_domain_info(sinfo->pdev),
+   addr, mask);
}
spin_unlock_irqrestore(_domain_lock, flags);
 }
-- 
2.25.1



[PATCH v2 2/3] iommu/vt-d: Track device aux-attach with subdevice_domain_info

2020-12-22 Thread Liu Yi L
In the existing code, loop all devices attached to a domain does not
include sub-devices attached via iommu_aux_attach_device().

This was found by when I'm working on the belwo patch, There is no
device in the domain->devices list, thus unable to get the cap and
ecap of iommu unit. But this domain actually has subdevice which is
attached via aux-manner. But it is tracked by domain. This patch is
going to fix it.

https://lore.kernel.org/kvm/1599734733-6431-17-git-send-email-yi.l@intel.com/

And this fix goes beyond the patch above, such sub-device tracking is
necessary for other cases. For example, flushing device_iotlb for a
domain which has sub-devices attached by auxiliary manner.

Co-developed-by: Xin Zeng 
Signed-off-by: Xin Zeng 
Signed-off-by: Liu Yi L 
---
 drivers/iommu/intel/iommu.c | 95 +++--
 include/linux/intel-iommu.h | 16 +--
 2 files changed, 82 insertions(+), 29 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index a49afa11673c..acfe0a5b955e 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1881,6 +1881,7 @@ static struct dmar_domain *alloc_domain(int flags)
domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
domain->has_iotlb_device = false;
INIT_LIST_HEAD(>devices);
+   INIT_LIST_HEAD(>subdevices);
 
return domain;
 }
@@ -2632,7 +2633,7 @@ static struct dmar_domain 
*dmar_insert_one_dev_info(struct intel_iommu *iommu,
info->iommu = iommu;
info->pasid_table = NULL;
info->auxd_enabled = 0;
-   INIT_LIST_HEAD(>auxiliary_domains);
+   INIT_LIST_HEAD(>subdevices);
 
if (dev && dev_is_pci(dev)) {
struct pci_dev *pdev = to_pci_dev(info->dev);
@@ -5172,33 +5173,61 @@ is_aux_domain(struct device *dev, struct iommu_domain 
*domain)
domain->type == IOMMU_DOMAIN_UNMANAGED;
 }
 
-static void auxiliary_link_device(struct dmar_domain *domain,
- struct device *dev)
+static inline struct subdev_domain_info *
+lookup_subdev_info(struct dmar_domain *domain, struct device *dev)
+{
+   struct subdev_domain_info *sinfo;
+
+   if (!list_empty(>subdevices)) {
+   list_for_each_entry(sinfo, >subdevices, link_domain) {
+   if (sinfo->pdev == dev)
+   return sinfo;
+   }
+   }
+
+   return NULL;
+}
+
+static int auxiliary_link_device(struct dmar_domain *domain,
+struct device *dev)
 {
struct device_domain_info *info = get_domain_info(dev);
+   struct subdev_domain_info *sinfo = lookup_subdev_info(domain, dev);
 
assert_spin_locked(_domain_lock);
if (WARN_ON(!info))
-   return;
+   return -EINVAL;
+
+   if (!sinfo) {
+   sinfo = kzalloc(sizeof(*sinfo), GFP_ATOMIC);
+   sinfo->domain = domain;
+   sinfo->pdev = dev;
+   list_add(>link_phys, >subdevices);
+   list_add(>link_domain, >subdevices);
+   }
 
-   domain->auxd_refcnt++;
-   list_add(>auxd, >auxiliary_domains);
+   return ++sinfo->users;
 }
 
-static void auxiliary_unlink_device(struct dmar_domain *domain,
-   struct device *dev)
+static int auxiliary_unlink_device(struct dmar_domain *domain,
+  struct device *dev)
 {
struct device_domain_info *info = get_domain_info(dev);
+   struct subdev_domain_info *sinfo = lookup_subdev_info(domain, dev);
+   int ret;
 
assert_spin_locked(_domain_lock);
-   if (WARN_ON(!info))
-   return;
+   if (WARN_ON(!info || !sinfo || sinfo->users <= 0))
+   return -EINVAL;
 
-   list_del(>auxd);
-   domain->auxd_refcnt--;
+   ret = --sinfo->users;
+   if (!ret) {
+   list_del(>link_phys);
+   list_del(>link_domain);
+   kfree(sinfo);
+   }
 
-   if (!domain->auxd_refcnt && domain->default_pasid > 0)
-   ioasid_free(domain->default_pasid);
+   return ret;
 }
 
 static int aux_domain_add_dev(struct dmar_domain *domain,
@@ -5227,6 +5256,19 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
}
 
spin_lock_irqsave(_domain_lock, flags);
+   ret = auxiliary_link_device(domain, dev);
+   if (ret <= 0)
+   goto link_failed;
+
+   /*
+* Subdevices from the same physical device can be attached to the
+* same domain. For such cases, only the first subdevice attachment
+* needs to go through the full steps in this function. So if ret >
+* 1, just goto out.
+*/
+   if (ret > 1)
+   goto out;
+
/*
   

[PATCH v2 0/3] iommu/vt-d: Misc fixes on scalable mode

2020-12-22 Thread Liu Yi L
This patchset aims to fix a bug regards to native SVM usage, and
also several bugs around subdevice (attached to device via auxiliary
manner) tracking and ineffective device_tlb flush.

Liu Yi L (3):
  iommu/vt-d: Move intel_iommu info from struct intel_svm to struct
intel_svm_dev
  iommu/vt-d: Track device aux-attach with subdevice_domain_info
  iommu/vt-d: Fix ineffective devTLB invalidation for subdevices

 drivers/iommu/intel/iommu.c | 158 +++-
 drivers/iommu/intel/svm.c   |   9 +-
 include/linux/intel-iommu.h |  18 ++--
 3 files changed, 135 insertions(+), 50 deletions(-)

-- 
2.25.1



RE: [PATCH v2 1/1] vfio/type1: Add vfio_group_domain()

2020-11-25 Thread Liu, Yi L
On Thurs, Nov 26, 2020, at 9:27 AM, Lu Baolu wrote:
> Add the API for getting the domain from a vfio group. This could be used
> by the physical device drivers which rely on the vfio/mdev framework for
> mediated device user level access. The typical use case like below:
> 
>   unsigned int pasid;
>   struct vfio_group *vfio_group;
>   struct iommu_domain *iommu_domain;
>   struct device *dev = mdev_dev(mdev);
>   struct device *iommu_device = mdev_get_iommu_device(dev);
> 
>   if (!iommu_device ||
>   !iommu_dev_feature_enabled(iommu_device, IOMMU_DEV_FEAT_AUX))
>   return -EINVAL;
> 
>   vfio_group = vfio_group_get_external_user_from_dev(dev);(dev);

duplicate (dev); other parts looks good to me. perhaps, you can also
describe that the release function of a sub-device fd should also call
vfio_group_put_external_user() to release its reference on the vfio_group.

Regards,
Yi Liu 

>   if (IS_ERR_OR_NULL(vfio_group))
>   return -EFAULT;
> 
>   iommu_domain = vfio_group_domain(vfio_group);
>   if (IS_ERR_OR_NULL(iommu_domain)) {
>   vfio_group_put_external_user(vfio_group);
>   return -EFAULT;
>   }
> 
>   pasid = iommu_aux_get_pasid(iommu_domain, iommu_device);
>   if (pasid < 0) {
>   vfio_group_put_external_user(vfio_group);
>   return -EFAULT;
>   }
> 
>   /* Program device context with pasid value. */
>   ...
> 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/vfio/vfio.c | 18 ++
>  drivers/vfio/vfio_iommu_type1.c | 23 +++
>  include/linux/vfio.h|  3 +++
>  3 files changed, 44 insertions(+)
> 
> Change log:
>  - v1: 
> https://lore.kernel.org/linux-iommu/20201112022407.2063896-1-baolu...@linux.intel.com/
>  - Changed according to comments @ 
> https://lore.kernel.org/linux-iommu/20201116125631.2d043...@w520.home/
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 2151bc7f87ab..62c652111c88 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -2331,6 +2331,24 @@ int vfio_unregister_notifier(struct device *dev,
> enum vfio_notify_type type,
>  }
>  EXPORT_SYMBOL(vfio_unregister_notifier);
> 
> +struct iommu_domain *vfio_group_domain(struct vfio_group *group)
> +{
> + struct vfio_container *container;
> + struct vfio_iommu_driver *driver;
> +
> + if (!group)
> + return ERR_PTR(-EINVAL);
> +
> + container = group->container;
> + driver = container->iommu_driver;
> + if (likely(driver && driver->ops->group_domain))
> + return driver->ops->group_domain(container->iommu_data,
> +  group->iommu_group);
> + else
> + return ERR_PTR(-ENOTTY);
> +}
> +EXPORT_SYMBOL(vfio_group_domain);
> +
>  /**
>   * Module/class support
>   */
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 67e827638995..783f18f21b95 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2980,6 +2980,28 @@ static int vfio_iommu_type1_dma_rw(void *iommu_data,
> dma_addr_t user_iova,
>   return ret;
>  }
> 
> +static void *vfio_iommu_type1_group_domain(void *iommu_data,
> +struct iommu_group *iommu_group)
> +{
> + struct vfio_iommu *iommu = iommu_data;
> + struct iommu_domain *domain = NULL;
> + struct vfio_domain *d;
> +
> + if (!iommu || !iommu_group)
> + return ERR_PTR(-EINVAL);
> +
> + mutex_lock(>lock);
> + list_for_each_entry(d, >domain_list, next) {
> + if (find_iommu_group(d, iommu_group)) {
> + domain = d->domain;
> + break;
> + }
> + }
> + mutex_unlock(>lock);
> +
> + return domain;
> +}
> +
>  static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>   .name   = "vfio-iommu-type1",
>   .owner  = THIS_MODULE,
> @@ -2993,6 +3015,7 @@ static const struct vfio_iommu_driver_ops 
> vfio_iommu_driver_ops_type1 = {
>   .register_notifier  = vfio_iommu_type1_register_notifier,
>   .unregister_notifier= vfio_iommu_type1_unregister_notifier,
>   .dma_rw = vfio_iommu_type1_dma_rw,
> + .group_domain   = vfio_iommu_type1_group_domain,
>  };
> 
>  static int __init vfio_iommu_type1_init(void)
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 38d3c6a8dc7e..a0613a6f21cc 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -90,6 +90,7 @@ struct vfio_iommu_driver_ops {
>  struct notifier_block *nb);
>   int (*dma_rw)(void *iommu_data, dma_addr_t user_iova,
> void *data, size_t count, bool write);
> + void*(*group_domain)(void *iommu_data, struct 

RE: [PATCH v6 07/15] vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free)

2020-08-20 Thread Liu, Yi L
Hi Alex,

> From: Alex Williamson 
> Sent: Friday, August 21, 2020 9:49 AM
> 
> On Fri, 21 Aug 2020 00:37:19 +0000
> "Liu, Yi L"  wrote:
> 
> > Hi Alex,
> >
> > > From: Alex Williamson 
> > > Sent: Friday, August 21, 2020 4:51 AM
> > >
> > > On Mon, 27 Jul 2020 23:27:36 -0700
> > > Liu Yi L  wrote:
> > >
> > > > This patch allows userspace to request PASID allocation/free, e.g.
> > > > when serving the request from the guest.
> > > >
> > > > PASIDs that are not freed by userspace are automatically freed when
> > > > the IOASID set is destroyed when process exits.
> > > >
> > > > Cc: Kevin Tian 
> > > > CC: Jacob Pan 
> > > > Cc: Alex Williamson 
> > > > Cc: Eric Auger 
> > > > Cc: Jean-Philippe Brucker 
> > > > Cc: Joerg Roedel 
> > > > Cc: Lu Baolu 
> > > > Signed-off-by: Liu Yi L 
> > > > Signed-off-by: Yi Sun 
> > > > Signed-off-by: Jacob Pan 
> > > > ---
> > > > v5 -> v6:
> > > > *) address comments from Eric against v5. remove the alloc/free helper.
> > > >
> > > > v4 -> v5:
> > > > *) address comments from Eric Auger.
> > > > *) the comments for the PASID_FREE request is addressed in patch 5/15 of
> > > >this series.
> > > >
> > > > v3 -> v4:
> > > > *) address comments from v3, except the below comment against the range
> > > >of PASID_FREE request. needs more help on it.
> > > > "> +if (req.range.min > req.range.max)
> > > >
> > > >  Is it exploitable that a user can spin the kernel for a long time 
> > > > in
> > > >  the case of a free by calling this with [0, MAX_UINT] regardless of
> > > >  their actual allocations?"
> > > >
> > > > https://lore.kernel.org/linux-iommu/20200702151832.048b4...@x1.home/
> > > >
> > > > v1 -> v2:
> > > > *) move the vfio_mm related code to be a seprate module
> > > > *) use a single structure for alloc/free, could support a range of
> > > > PASIDs
> > > > *) fetch vfio_mm at group_attach time instead of at iommu driver open
> > > > time
> > > > ---
> > > >  drivers/vfio/Kconfig|  1 +
> > > >  drivers/vfio/vfio_iommu_type1.c | 69
> > > +
> > > >  drivers/vfio/vfio_pasid.c   | 10 ++
> > > >  include/linux/vfio.h|  6 
> > > >  include/uapi/linux/vfio.h   | 37 ++
> > > >  5 files changed, 123 insertions(+)
> > > >
> > > > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index
> > > > 3d8a108..95d90c6 100644
> > > > --- a/drivers/vfio/Kconfig
> > > > +++ b/drivers/vfio/Kconfig
> > > > @@ -2,6 +2,7 @@
> > > >  config VFIO_IOMMU_TYPE1
> > > > tristate
> > > > depends on VFIO
> > > > +   select VFIO_PASID if (X86)
> > > > default n
> > > >
> > > >  config VFIO_IOMMU_SPAPR_TCE
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > > b/drivers/vfio/vfio_iommu_type1.c index 18ff0c3..ea89c7c 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -76,6 +76,7 @@ struct vfio_iommu {
> > > > booldirty_page_tracking;
> > > > boolpinned_page_dirty_scope;
> > > > struct iommu_nesting_info   *nesting_info;
> > > > +   struct vfio_mm  *vmm;
> > > >  };
> > > >
> > > >  struct vfio_domain {
> > > > @@ -1937,6 +1938,11 @@ static void vfio_iommu_iova_insert_copy(struct
> > > > vfio_iommu *iommu,
> > > >
> > > >  static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
> > > > {
> > > > +   if (iommu->vmm) {
> > > > +   vfio_mm_put(iommu->vmm);
> > > > +   iommu->vmm = NULL;
> > > > +   }
> > > > +
> > > > kfree(iommu->nesting_info);
> > > > iommu->nesting_info = NULL;
> > > >  }
> > > > @@ -2071,6 +2077,26 @@ sta

RE: [PATCH v6 12/15] vfio/type1: Add vSVA support for IOMMU-backed mdevs

2020-08-20 Thread Liu, Yi L
Hi Alex,

> From: Alex Williamson 
> Sent: Friday, August 21, 2020 5:49 AM
> 
> On Mon, 27 Jul 2020 23:27:41 -0700
> Liu Yi L  wrote:
> 
> > Recent years, mediated device pass-through framework (e.g. vfio-mdev)
> > is used to achieve flexible device sharing across domains (e.g. VMs).
> > Also there are hardware assisted mediated pass-through solutions from
> > platform vendors. e.g. Intel VT-d scalable mode which supports Intel
> > Scalable I/O Virtualization technology. Such mdevs are called IOMMU-
> > backed mdevs as there are IOMMU enforced DMA isolation for such mdevs.
> > In kernel, IOMMU-backed mdevs are exposed to IOMMU layer by aux-domain
> 
> Or a physical IOMMU backing device.

got it. :-)

> > concept, which means mdevs are protected by an iommu domain which is
> > auxiliary to the domain that the kernel driver primarily uses for DMA
> > API. Details can be found in the KVM presentation as below:
> >
> > https://events19.linuxfoundation.org/wp-content/uploads/2017/12/\
> > Hardware-Assisted-Mediated-Pass-Through-with-VFIO-Kevin-Tian-Intel.pdf
> 
> I think letting the line exceed 80 columns is preferable so that it's 
> clickable.  Thanks,

yeah, it's clickable now. will do it. :-)

Thanks,
Yi Liu

> Alex
> 
> > This patch extends NESTING_IOMMU ops to IOMMU-backed mdev devices. The
> > main requirement is to use the auxiliary domain associated with mdev.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > CC: Jun Tian 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Reviewed-by: Eric Auger 
> > Signed-off-by: Liu Yi L 
> > ---
> > v5 -> v6:
> > *) add review-by from Eric Auger.
> >
> > v1 -> v2:
> > *) check the iommu_device to ensure the handling mdev is IOMMU-backed
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 40
> > 
> >  1 file changed, 36 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index bf95a0f..9d8f252 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -2379,20 +2379,41 @@ static int vfio_iommu_resv_refresh(struct
> vfio_iommu *iommu,
> > return ret;
> >  }
> >
> > +static struct device *vfio_get_iommu_device(struct vfio_group *group,
> > +   struct device *dev)
> > +{
> > +   if (group->mdev_group)
> > +   return vfio_mdev_get_iommu_device(dev);
> > +   else
> > +   return dev;
> > +}
> > +
> >  static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)  {
> > struct domain_capsule *dc = (struct domain_capsule *)data;
> > unsigned long arg = *(unsigned long *)dc->data;
> > +   struct device *iommu_device;
> > +
> > +   iommu_device = vfio_get_iommu_device(dc->group, dev);
> > +   if (!iommu_device)
> > +   return -EINVAL;
> >
> > -   return iommu_uapi_sva_bind_gpasid(dc->domain, dev, (void __user *)arg);
> > +   return iommu_uapi_sva_bind_gpasid(dc->domain, iommu_device,
> > + (void __user *)arg);
> >  }
> >
> >  static int vfio_dev_unbind_gpasid_fn(struct device *dev, void *data)
> > {
> > struct domain_capsule *dc = (struct domain_capsule *)data;
> > unsigned long arg = *(unsigned long *)dc->data;
> > +   struct device *iommu_device;
> >
> > -   iommu_uapi_sva_unbind_gpasid(dc->domain, dev, (void __user *)arg);
> > +   iommu_device = vfio_get_iommu_device(dc->group, dev);
> > +   if (!iommu_device)
> > +   return -EINVAL;
> > +
> > +   iommu_uapi_sva_unbind_gpasid(dc->domain, iommu_device,
> > +(void __user *)arg);
> > return 0;
> >  }
> >
> > @@ -2401,8 +2422,13 @@ static int __vfio_dev_unbind_gpasid_fn(struct device
> *dev, void *data)
> > struct domain_capsule *dc = (struct domain_capsule *)data;
> > struct iommu_gpasid_bind_data *unbind_data =
> > (struct iommu_gpasid_bind_data *)dc->data;
> > +   struct device *iommu_device;
> > +
> > +   iommu_device = vfio_get_iommu_device(dc->group, dev);
> > +   if (!iommu_device)
> > +   return -EINVAL;
> >
> > -   iommu_sva_unbind_gpasid(dc->domain, dev, unbind_data);
> > +   iommu_sva_unbind_gpasid(dc->domain, iommu_device, unbind_data);
> > return 0;
> >  }
> >
> > @@ -3060,8 +3086,14 @@ static int vfio_dev_cache_invalidate_fn(struct
> > device *dev, void *data)  {
> > struct domain_capsule *dc = (struct domain_capsule *)data;
> > unsigned long arg = *(unsigned long *)dc->data;
> > +   struct device *iommu_device;
> > +
> > +   iommu_device = vfio_get_iommu_device(dc->group, dev);
> > +   if (!iommu_device)
> > +   return -EINVAL;
> >
> > -   iommu_uapi_cache_invalidate(dc->domain, dev, (void __user *)arg);
> > +   iommu_uapi_cache_invalidate(dc->domain, iommu_device,
> > +   (void __user *)arg);
> > return 0;
> >  }
> >



RE: [PATCH v6 04/15] vfio/type1: Report iommu nesting info to userspace

2020-08-20 Thread Liu, Yi L
Hi Alex,

> From: Alex Williamson 
> Sent: Friday, August 21, 2020 3:52 AM
> 
> On Mon, 27 Jul 2020 23:27:33 -0700
> Liu Yi L  wrote:
> 
> > This patch exports iommu nesting capability info to user space through
> > VFIO. Userspace is expected to check this info for supported uAPIs (e.g.
> > PASID alloc/free, bind page table, and cache invalidation) and the vendor
> > specific format information for first level/stage page table that will be
> > bound to.
> >
> > The nesting info is available only after container set to be NESTED type.
> > Current implementation imposes one limitation - one nesting container
> > should include at most one iommu group. The philosophy of vfio container
> > is having all groups/devices within the container share the same IOMMU
> > context. When vSVA is enabled, one IOMMU context could include one 2nd-
> > level address space and multiple 1st-level address spaces. While the
> > 2nd-level address space is reasonably sharable by multiple groups, blindly
> > sharing 1st-level address spaces across all groups within the container
> > might instead break the guest expectation. In the future sub/super container
> > concept might be introduced to allow partial address space sharing within
> > an IOMMU context. But for now let's go with this restriction by requiring
> > singleton container for using nesting iommu features. Below link has the
> > related discussion about this decision.
> >
> > https://lore.kernel.org/kvm/20200515115924.37e69...@w520.home/
> >
> > This patch also changes the NESTING type container behaviour. Something
> > that would have succeeded before will now fail: Before this series, if
> > user asked for a VFIO_IOMMU_TYPE1_NESTING, it would have succeeded even
> > if the SMMU didn't support stage-2, as the driver would have silently
> > fallen back on stage-1 mappings (which work exactly the same as stage-2
> > only since there was no nesting supported). After the series, we do check
> > for DOMAIN_ATTR_NESTING so if user asks for VFIO_IOMMU_TYPE1_NESTING
> and
> > the SMMU doesn't support stage-2, the ioctl fails. But it should be a good
> > fix and completely harmless. Detail can be found in below link as well.
> >
> > https://lore.kernel.org/kvm/20200717090900.GC4850@myrica/
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > ---
> > v5 -> v6:
> > *) address comments against v5 from Eric Auger.
> > *) don't report nesting cap to userspace if the nesting_info->format is
> >invalid.
> >
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > *) return struct iommu_nesting_info for
> VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as
> >cap is much "cheap", if needs extension in future, just define another 
> > cap.
> >https://lore.kernel.org/kvm/20200708132947.5b7ee...@x1.home/
> >
> > v3 -> v4:
> > *) address comments against v3.
> >
> > v1 -> v2:
> > *) added in v2
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 106
> +++-
> >  include/uapi/linux/vfio.h   |  19 +++
> >  2 files changed, 113 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c 
> > b/drivers/vfio/vfio_iommu_type1.c
> > index 3bd70ff..18ff0c3 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -62,18 +62,20 @@ MODULE_PARM_DESC(dma_entry_limit,
> >  "Maximum number of user DMA mappings per container (65535).");
> >
> >  struct vfio_iommu {
> > -   struct list_headdomain_list;
> > -   struct list_headiova_list;
> > -   struct vfio_domain  *external_domain; /* domain for external user */
> > -   struct mutexlock;
> > -   struct rb_root  dma_list;
> > -   struct blocking_notifier_head notifier;
> > -   unsigned intdma_avail;
> > -   uint64_tpgsize_bitmap;
> > -   boolv2;
> > -   boolnesting;
> > -   booldirty_page_tracking;
> > -   boolpinned_page_dirty_scope;
> > +   struct list_headdomain_list;
> > +   struct list_headiova_list;
> > +   /* domain for external user */
> > +   struct vfio_domain  *external_domain;
> > +   struct

RE: [PATCH v6 07/15] vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free)

2020-08-20 Thread Liu, Yi L
Hi Alex,

> From: Alex Williamson 
> Sent: Friday, August 21, 2020 4:51 AM
> 
> On Mon, 27 Jul 2020 23:27:36 -0700
> Liu Yi L  wrote:
> 
> > This patch allows userspace to request PASID allocation/free, e.g.
> > when serving the request from the guest.
> >
> > PASIDs that are not freed by userspace are automatically freed when
> > the IOASID set is destroyed when process exits.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Yi Sun 
> > Signed-off-by: Jacob Pan 
> > ---
> > v5 -> v6:
> > *) address comments from Eric against v5. remove the alloc/free helper.
> >
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > *) the comments for the PASID_FREE request is addressed in patch 5/15 of
> >this series.
> >
> > v3 -> v4:
> > *) address comments from v3, except the below comment against the range
> >of PASID_FREE request. needs more help on it.
> > "> +if (req.range.min > req.range.max)
> >
> >  Is it exploitable that a user can spin the kernel for a long time in
> >  the case of a free by calling this with [0, MAX_UINT] regardless of
> >  their actual allocations?"
> >
> > https://lore.kernel.org/linux-iommu/20200702151832.048b4...@x1.home/
> >
> > v1 -> v2:
> > *) move the vfio_mm related code to be a seprate module
> > *) use a single structure for alloc/free, could support a range of
> > PASIDs
> > *) fetch vfio_mm at group_attach time instead of at iommu driver open
> > time
> > ---
> >  drivers/vfio/Kconfig|  1 +
> >  drivers/vfio/vfio_iommu_type1.c | 69
> +
> >  drivers/vfio/vfio_pasid.c   | 10 ++
> >  include/linux/vfio.h|  6 
> >  include/uapi/linux/vfio.h   | 37 ++
> >  5 files changed, 123 insertions(+)
> >
> > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index
> > 3d8a108..95d90c6 100644
> > --- a/drivers/vfio/Kconfig
> > +++ b/drivers/vfio/Kconfig
> > @@ -2,6 +2,7 @@
> >  config VFIO_IOMMU_TYPE1
> > tristate
> > depends on VFIO
> > +   select VFIO_PASID if (X86)
> > default n
> >
> >  config VFIO_IOMMU_SPAPR_TCE
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index 18ff0c3..ea89c7c 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -76,6 +76,7 @@ struct vfio_iommu {
> > booldirty_page_tracking;
> > boolpinned_page_dirty_scope;
> > struct iommu_nesting_info   *nesting_info;
> > +   struct vfio_mm  *vmm;
> >  };
> >
> >  struct vfio_domain {
> > @@ -1937,6 +1938,11 @@ static void vfio_iommu_iova_insert_copy(struct
> > vfio_iommu *iommu,
> >
> >  static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
> > {
> > +   if (iommu->vmm) {
> > +   vfio_mm_put(iommu->vmm);
> > +   iommu->vmm = NULL;
> > +   }
> > +
> > kfree(iommu->nesting_info);
> > iommu->nesting_info = NULL;
> >  }
> > @@ -2071,6 +2077,26 @@ static int vfio_iommu_type1_attach_group(void
> *iommu_data,
> > iommu->nesting_info);
> > if (ret)
> > goto out_detach;
> > +
> > +   if (iommu->nesting_info->features &
> > +   IOMMU_NESTING_FEAT_SYSWIDE_PASID)
> {
> > +   struct vfio_mm *vmm;
> > +   int sid;
> > +
> > +   vmm = vfio_mm_get_from_task(current);
> > +   if (IS_ERR(vmm)) {
> > +   ret = PTR_ERR(vmm);
> > +   goto out_detach;
> > +   }
> > +   iommu->vmm = vmm;
> > +
> > +   sid = vfio_mm_ioasid_sid(vmm);
> > +   ret = iommu_domain_set_attr(domain->domain,
> > +   DOMAIN_ATTR_IOASID_SID,
> > +   );
> > +   if (ret)
> > +   goto out_detach;
> > +  

RE: [PATCH v6 08/15] iommu: Pass domain to sva_unbind_gpasid()

2020-08-20 Thread Liu, Yi L
Hi Alex,

> From: Alex Williamson 
> Sent: Friday, August 21, 2020 5:06 AM
> 
> On Mon, 27 Jul 2020 23:27:37 -0700
> Liu Yi L  wrote:
> 
> > From: Yi Sun 
> >
> > Current interface is good enough for SVA virtualization on an assigned
> > physical PCI device, but when it comes to mediated devices, a physical
> > device may attached with multiple aux-domains. Also, for guest unbind,
> 
> s/may/may be/

got it.

> 
> > the PASID to be unbind should be allocated to the VM. This check
> > requires to know the ioasid_set which is associated with the domain.
> >
> > So this interface needs to pass in domain info. Then the iommu driver
> > is able to know which domain will be used for the 2nd stage
> > translation of the nesting mode and also be able to do PASID ownership
> > check. This patch passes @domain per the above reason. Also, the
> > prototype of  is changed frnt" to "u32" as the below link.
> 
> s/frnt"/from an "int"/

got it.

> > https://lore.kernel.org/kvm/27ac7880-bdd3-2891-139e-b4a7cd18420b@redha
> > t.com/
> 
> This is really confusing, the link is to Eric's comment asking that the 
> conversion from
> (at the time) int to ioasid_t be included in the commit log.  The text here 
> implies that
> it's pointing to some sort of justification for the change, which it isn't.  
> It just notes
> that it happened, not why it happened, with a mostly irrelevant link.

really sorry, a mistake from me. it should be the below link.

[PATCH v6 01/12] iommu: Change type of pasid to u32
https://lore.kernel.org/linux-iommu/1594684087-61184-2-git-send-email-fenghua...@intel.com/

> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Reviewed-by: Eric Auger 
> > Signed-off-by: Yi Sun 
> > Signed-off-by: Liu Yi L 
> > ---
> > v5 -> v6:
> > *) use "u32" prototype for @pasid.
> > *) add review-by from Eric Auger.
> 
> I'd probably hold off on adding Eric's R-b given the additional change in 
> this version
> FWIW.  Thanks,

ok, will hold on it. :-)

Regards,
Yi Liu

> Alex
> 
> > v2 -> v3:
> > *) pass in domain info only
> > *) use u32 for pasid instead of int type
> >
> > v1 -> v2:
> > *) added in v2.
> > ---
> >  drivers/iommu/intel/svm.c   | 3 ++-
> >  drivers/iommu/iommu.c   | 2 +-
> >  include/linux/intel-iommu.h | 3 ++-
> >  include/linux/iommu.h   | 3 ++-
> >  4 files changed, 7 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> > index c27d16a..c85b8d5 100644
> > --- a/drivers/iommu/intel/svm.c
> > +++ b/drivers/iommu/intel/svm.c
> > @@ -436,7 +436,8 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain,
> struct device *dev,
> > return ret;
> >  }
> >
> > -int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> > +int intel_svm_unbind_gpasid(struct iommu_domain *domain,
> > +   struct device *dev, u32 pasid)
> >  {
> > struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > struct intel_svm_dev *sdev;
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index
> > 1ce2a61..bee79d7 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -2145,7 +2145,7 @@ int iommu_sva_unbind_gpasid(struct iommu_domain
> *domain, struct device *dev,
> > if (unlikely(!domain->ops->sva_unbind_gpasid))
> > return -ENODEV;
> >
> > -   return domain->ops->sva_unbind_gpasid(dev, data->hpasid);
> > +   return domain->ops->sva_unbind_gpasid(domain, dev, data->hpasid);
> >  }
> >  EXPORT_SYMBOL_GPL(iommu_sva_unbind_gpasid);
> >
> > diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> > index 0d0ab32..f98146b 100644
> > --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -738,7 +738,8 @@ extern int intel_svm_enable_prq(struct intel_iommu
> > *iommu);  extern int intel_svm_finish_prq(struct intel_iommu *iommu);
> > int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
> >   struct iommu_gpasid_bind_data *data); -int
> > intel_svm_unbind_gpasid(struct device *dev, int pasid);
> > +int intel_svm_unbind_gpasid(struct iommu_domain *domain,
> > +   struct device *dev, u32 pasid);
> >  struct iommu_sva *intel_svm_bind(struct device *dev, struct mm_str

RE: [PATCH v6 15/15] iommu/vt-d: Support reporting nesting capability info

2020-08-17 Thread Liu, Yi L
Eric,

> From: Auger Eric 
> Sent: Monday, August 17, 2020 3:43 PM
> 
> On 8/17/20 9:05 AM, Liu, Yi L wrote:
> > Hi Eric,
> >
> >> Auger Eric 
> >> Sent: Sunday, August 16, 2020 8:01 PM
> >>
> >> Hi Yi,
> >>
> >> On 7/28/20 8:27 AM, Liu Yi L wrote:
> >>> This patch reports nesting info, and only supports the case where all
> >>> the physical iomms have the same CAP/ECAP MASKS.
> >> s/iomms/iommus
> >
> > yep.
> >
> >>>
> >>> Cc: Kevin Tian 
> >>> CC: Jacob Pan 
> >>> Cc: Alex Williamson 
> >>> Cc: Eric Auger 
> >>> Cc: Jean-Philippe Brucker 
> >>> Cc: Joerg Roedel 
> >>> Cc: Lu Baolu 
> >>> Signed-off-by: Liu Yi L 
> >>> Signed-off-by: Jacob Pan 
> >>> ---
> >>> v2 -> v3:
> >>> *) remove cap/ecap_mask in iommu_nesting_info.
> >>> ---
> >>>  drivers/iommu/intel/iommu.c | 81
> >> +++--
> >>>  include/linux/intel-iommu.h | 16 +
> >>>  2 files changed, 95 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> >>> index 88f4647..0835804 100644
> >>> --- a/drivers/iommu/intel/iommu.c
> >>> +++ b/drivers/iommu/intel/iommu.c
> >>> @@ -5660,12 +5660,16 @@ static inline bool iommu_pasid_support(void)
> >>>  static inline bool nested_mode_support(void)
> >>>  {
> >>>   struct dmar_drhd_unit *drhd;
> >>> - struct intel_iommu *iommu;
> >>> + struct intel_iommu *iommu, *prev = NULL;
> >>>   bool ret = true;
> >>>
> >>>   rcu_read_lock();
> >>>   for_each_active_iommu(iommu, drhd) {
> >>> - if (!sm_supported(iommu) || !ecap_nest(iommu->ecap)) {
> >>> + if (!prev)
> >>> + prev = iommu;
> >>> + if (!sm_supported(iommu) || !ecap_nest(iommu->ecap) ||
> >>> + (VTD_CAP_MASK & (iommu->cap ^ prev->cap)) ||
> >>> + (VTD_ECAP_MASK & (iommu->ecap ^ prev->ecap))) {
> >>>   ret = false;
> >>>   break;
> >> So this changes the behavior of DOMAIN_ATTR_NESTING. Shouldn't it have a
> >> Fixes tag as well? And maybe add the capability getter in a separate patch?
> >
> > yes, this changed the behavior. so it would be better to be a separate patch
> > and upstream along? how about your idea? @Lu, Baolu :-)
> >
> >>>   }
> >>> @@ -6081,6 +6085,78 @@ intel_iommu_domain_set_attr(struct
> iommu_domain
> >> *domain,
> >>>   return ret;
> >>>  }
> >>>
> >>> +static int intel_iommu_get_nesting_info(struct iommu_domain *domain,
> >>> + struct iommu_nesting_info *info)
> >>> +{
> >>> + struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> >>> + u64 cap = VTD_CAP_MASK, ecap = VTD_ECAP_MASK;
> >>> + struct device_domain_info *domain_info;
> >>> + struct iommu_nesting_info_vtd vtd;
> >>> + unsigned long flags;
> >>> + unsigned int size;
> >>> +
> >
> > perhaps better to acquire the lock here. [1]
> >
> >>> + if (domain->type != IOMMU_DOMAIN_UNMANAGED ||
> >>> + !(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE))
> >>> + return -ENODEV;
> >>> +
> >>> + if (!info)
> >>> + return -EINVAL;
> >>> +
> >>> + size = sizeof(struct iommu_nesting_info) +
> >>> + sizeof(struct iommu_nesting_info_vtd);
> >>> + /*
> >>> +  * if provided buffer size is smaller than expected, should
> >>> +  * return 0 and also the expected buffer size to caller.
> >>> +  */
> >>> + if (info->argsz < size) {
> >>> + info->argsz = size;
> >>> + return 0;
> >>> + }
> >>> +
> >>> + spin_lock_irqsave(_domain_lock, flags);
> >>> + /*
> >>> +  * arbitrary select the first domain_info as all nesting
> >>> +  * related capabilities should be consistent across iommu
> >>> +  * units.
> >>> +  */
> >>> + domain_info = list_first_entry(_domain->devices,
> >>> +struc

RE: [PATCH v6 14/15] vfio: Document dual stage control

2020-08-17 Thread Liu, Yi L
Hi Eric,

> From: Eric Auger 
> Sent: Monday, August 17, 2020 3:41 PM
> 
> Hi Yi,
> 
> On 8/17/20 9:00 AM, Liu, Yi L wrote:
> > Hi Eric,
> >
> >> From: Auger Eric 
> >> Sent: Sunday, August 16, 2020 7:52 PM
> >>
> >> Hi Yi,
> >>
> >> On 7/28/20 8:27 AM, Liu Yi L wrote:
> >>> From: Eric Auger 
> >>>
> >>> The VFIO API was enhanced to support nested stage control: a bunch
> >>> of> new
> >> ioctls and usage guideline.
> >>>
> >>> Let's document the process to follow to set up nested mode.
> >>>
> >>> Cc: Kevin Tian 
> >>> CC: Jacob Pan 
> >>> Cc: Alex Williamson 
> >>> Cc: Eric Auger 
> >>> Cc: Jean-Philippe Brucker 
> >>> Cc: Joerg Roedel 
> >>> Cc: Lu Baolu 
> >>> Reviewed-by: Stefan Hajnoczi 
> >>> Signed-off-by: Eric Auger 
> >>> Signed-off-by: Liu Yi L 
> >>> ---
> >>> v5 -> v6:
> >>> *) tweak per Eric's comments.
> >>>
> >>> v3 -> v4:
> >>> *) add review-by from Stefan Hajnoczi
> >>>
> >>> v2 -> v3:
> >>> *) address comments from Stefan Hajnoczi
> >>>
> >>> v1 -> v2:
> >>> *) new in v2, compared with Eric's original version, pasid table bind
> >>>and fault reporting is removed as this series doesn't cover them.
> >>>Original version from Eric.
> >>>https://lkml.org/lkml/2020/3/20/700
> >>> ---
> >>>  Documentation/driver-api/vfio.rst | 75
> >> +++
> >>>  1 file changed, 75 insertions(+)
> >>>
> >>> diff --git a/Documentation/driver-api/vfio.rst
> >>> b/Documentation/driver-api/vfio.rst
> >>> index f1a4d3c..c0d43f0 100644
> >>> --- a/Documentation/driver-api/vfio.rst
> >>> +++ b/Documentation/driver-api/vfio.rst
> >>> @@ -239,6 +239,81 @@ group and can access them as follows::
> >>>   /* Gratuitous device reset and go... */
> >>>   ioctl(device, VFIO_DEVICE_RESET);
> >>>
> >>> +IOMMU Dual Stage Control
> >>> +
> >>> +
> >>> +Some IOMMUs support 2 stages/levels of translation. Stage
> >>> +corresponds to the ARM terminology while level corresponds to Intel's
> terminology.
> >>> +In the following text we use either without distinction.
> >>> +
> >>> +This is useful when the guest is exposed with a virtual IOMMU and
> >>> +some devices are assigned to the guest through VFIO. Then the guest
> >>> +OS can use stage-1 (GIOVA -> GPA or GVA->GPA), while the hypervisor
> >>> +uses stage
> >>> +2 for VM isolation (GPA -> HPA).
> >>> +
> >>> +Under dual stage translation, the guest gets ownership of the
> >>> +stage-1 page tables and also owns stage-1 configuration structures.
> >>> +The hypervisor owns the root configuration structure (for security
> >>> +reason), including stage-2 configuration.
> >> This is only true for vtd. On ARM the stage2 cfg is the Context
> >> Descriptor table (aka PASID table). root cfg only store the GPA of
> >> the CD table.
> >
> > I've a check with you on the meaning of "configuration structures".
> > For Vt-d, does it mean the root table/context table/pasid table? if
> > I'm correct, then how about below description?
> Yes I agree

thanks.

> >
> > "Under dual stage translation, the guest gets ownership of the stage-1
> > configuration structures or page tables.
> Actually on ARM the guest both owns the S1 configuration (CD table) and
> S1 page tables ;-)

I see. so on ARM platform, guest owns both configuration and page table.

> on Intel I understand the guest only owns the S1 page tables.

yes, on Intel, guest only owns the S1 page tables.

> If confirmed, you may use such kind of explicit statement.

will do.

Regards,
Yi Liu

> Thanks
> 
> Eric
> 
>  This depends on vendor. The
> > hypervisor owns the root configuration structure (for security
> > reason), including stage-2 configuration."
> >
> >>  This works as long as configuration structures and page table
> >>> +formats are compatible between the virtual IOMMU and the physical IOMMU.
> >>> +
> >>> +Assuming the HW supports it, this nested mode is selected by
> >>> +choosing the VFIO_TYPE1_NESTING_IO

RE: [PATCH v6 15/15] iommu/vt-d: Support reporting nesting capability info

2020-08-17 Thread Liu, Yi L
Hi Eric,

> Auger Eric 
> Sent: Sunday, August 16, 2020 8:01 PM
> 
> Hi Yi,
> 
> On 7/28/20 8:27 AM, Liu Yi L wrote:
> > This patch reports nesting info, and only supports the case where all
> > the physical iomms have the same CAP/ECAP MASKS.
> s/iomms/iommus

yep.

> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Jacob Pan 
> > ---
> > v2 -> v3:
> > *) remove cap/ecap_mask in iommu_nesting_info.
> > ---
> >  drivers/iommu/intel/iommu.c | 81
> +++--
> >  include/linux/intel-iommu.h | 16 +
> >  2 files changed, 95 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 88f4647..0835804 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -5660,12 +5660,16 @@ static inline bool iommu_pasid_support(void)
> >  static inline bool nested_mode_support(void)
> >  {
> > struct dmar_drhd_unit *drhd;
> > -   struct intel_iommu *iommu;
> > +   struct intel_iommu *iommu, *prev = NULL;
> > bool ret = true;
> >
> > rcu_read_lock();
> > for_each_active_iommu(iommu, drhd) {
> > -   if (!sm_supported(iommu) || !ecap_nest(iommu->ecap)) {
> > +   if (!prev)
> > +   prev = iommu;
> > +   if (!sm_supported(iommu) || !ecap_nest(iommu->ecap) ||
> > +   (VTD_CAP_MASK & (iommu->cap ^ prev->cap)) ||
> > +   (VTD_ECAP_MASK & (iommu->ecap ^ prev->ecap))) {
> > ret = false;
> > break;
> So this changes the behavior of DOMAIN_ATTR_NESTING. Shouldn't it have a
> Fixes tag as well? And maybe add the capability getter in a separate patch?

yes, this changed the behavior. so it would be better to be a separate patch
and upstream along? how about your idea? @Lu, Baolu :-)

> > }
> > @@ -6081,6 +6085,78 @@ intel_iommu_domain_set_attr(struct iommu_domain
> *domain,
> > return ret;
> >  }
> >
> > +static int intel_iommu_get_nesting_info(struct iommu_domain *domain,
> > +   struct iommu_nesting_info *info)
> > +{
> > +   struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > +   u64 cap = VTD_CAP_MASK, ecap = VTD_ECAP_MASK;
> > +   struct device_domain_info *domain_info;
> > +   struct iommu_nesting_info_vtd vtd;
> > +   unsigned long flags;
> > +   unsigned int size;
> > +

perhaps better to acquire the lock here. [1]

> > +   if (domain->type != IOMMU_DOMAIN_UNMANAGED ||
> > +   !(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE))
> > +   return -ENODEV;
> > +
> > +   if (!info)
> > +   return -EINVAL;
> > +
> > +   size = sizeof(struct iommu_nesting_info) +
> > +   sizeof(struct iommu_nesting_info_vtd);
> > +   /*
> > +* if provided buffer size is smaller than expected, should
> > +* return 0 and also the expected buffer size to caller.
> > +*/
> > +   if (info->argsz < size) {
> > +   info->argsz = size;
> > +   return 0;
> > +   }
> > +
> > +   spin_lock_irqsave(_domain_lock, flags);
> > +   /*
> > +* arbitrary select the first domain_info as all nesting
> > +* related capabilities should be consistent across iommu
> > +* units.
> > +*/
> > +   domain_info = list_first_entry(_domain->devices,
> > +  struct device_domain_info, link);
> > +   cap &= domain_info->iommu->cap;
> > +   ecap &= domain_info->iommu->ecap;
> > +   spin_unlock_irqrestore(_domain_lock, flags);
> > +
> > +   info->format = IOMMU_PASID_FORMAT_INTEL_VTD;
> > +   info->features = IOMMU_NESTING_FEAT_SYSWIDE_PASID |
> > +IOMMU_NESTING_FEAT_BIND_PGTBL |
> > +IOMMU_NESTING_FEAT_CACHE_INVLD;
> > +   info->addr_width = dmar_domain->gaw;
> > +   info->pasid_bits = ilog2(intel_pasid_max_id);
> > +   info->padding = 0;
> > +   vtd.flags = 0;
> > +   vtd.padding = 0;
> > +   vtd.cap_reg = cap;
> > +   vtd.ecap_reg = ecap;
> > +
> > +   memcpy(info->data, , sizeof(vtd));
> > +   return 0;
> > +}
> > +
> > +static int intel_iommu_doma

RE: [PATCH v6 14/15] vfio: Document dual stage control

2020-08-17 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Sunday, August 16, 2020 7:52 PM
> 
> Hi Yi,
> 
> On 7/28/20 8:27 AM, Liu Yi L wrote:
> > From: Eric Auger 
> >
> > The VFIO API was enhanced to support nested stage control: a bunch of> new
> ioctls and usage guideline.
> >
> > Let's document the process to follow to set up nested mode.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Reviewed-by: Stefan Hajnoczi 
> > Signed-off-by: Eric Auger 
> > Signed-off-by: Liu Yi L 
> > ---
> > v5 -> v6:
> > *) tweak per Eric's comments.
> >
> > v3 -> v4:
> > *) add review-by from Stefan Hajnoczi
> >
> > v2 -> v3:
> > *) address comments from Stefan Hajnoczi
> >
> > v1 -> v2:
> > *) new in v2, compared with Eric's original version, pasid table bind
> >and fault reporting is removed as this series doesn't cover them.
> >Original version from Eric.
> >https://lkml.org/lkml/2020/3/20/700
> > ---
> >  Documentation/driver-api/vfio.rst | 75
> +++
> >  1 file changed, 75 insertions(+)
> >
> > diff --git a/Documentation/driver-api/vfio.rst 
> > b/Documentation/driver-api/vfio.rst
> > index f1a4d3c..c0d43f0 100644
> > --- a/Documentation/driver-api/vfio.rst
> > +++ b/Documentation/driver-api/vfio.rst
> > @@ -239,6 +239,81 @@ group and can access them as follows::
> > /* Gratuitous device reset and go... */
> > ioctl(device, VFIO_DEVICE_RESET);
> >
> > +IOMMU Dual Stage Control
> > +
> > +
> > +Some IOMMUs support 2 stages/levels of translation. Stage corresponds
> > +to the ARM terminology while level corresponds to Intel's terminology.
> > +In the following text we use either without distinction.
> > +
> > +This is useful when the guest is exposed with a virtual IOMMU and some
> > +devices are assigned to the guest through VFIO. Then the guest OS can
> > +use stage-1 (GIOVA -> GPA or GVA->GPA), while the hypervisor uses stage
> > +2 for VM isolation (GPA -> HPA).
> > +
> > +Under dual stage translation, the guest gets ownership of the stage-1 page
> > +tables and also owns stage-1 configuration structures. The hypervisor owns
> > +the root configuration structure (for security reason), including stage-2
> > +configuration.
> This is only true for vtd. On ARM the stage2 cfg is the Context
> Descriptor table (aka PASID table). root cfg only store the GPA of the
> CD table.

I've a check with you on the meaning of "configuration structures".
For Vt-d, does it mean the root table/context table/pasid table? if
I'm correct, then how about below description?

"Under dual stage translation, the guest gets ownership of the stage-1
configuration structures or page tables. This depends on vendor. The
hypervisor owns the root configuration structure (for security reason),
including stage-2 configuration."

>  This works as long as configuration structures and page table
> > +formats are compatible between the virtual IOMMU and the physical IOMMU.
> > +
> > +Assuming the HW supports it, this nested mode is selected by choosing the
> > +VFIO_TYPE1_NESTING_IOMMU type through:
> > +
> > +ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU);
> > +
> > +This forces the hypervisor to use the stage-2, leaving stage-1 available
> > +for guest usage. The stage-1 format and binding method are vendor specific
> . There are reported in the nesting capability ...

got it.

"The stage-1 format and binding method are reported in nesting capability."

> > +and reported in nesting cap (VFIO_IOMMU_TYPE1_INFO_CAP_NESTING) through
> > +VFIO_IOMMU_GET_INFO:
> > +
> > +ioctl(container->fd, VFIO_IOMMU_GET_INFO, _info);
> > +
> > +The nesting cap info is available only after NESTING_IOMMU is selected.
> > +If underlying IOMMU doesn't support nesting, VFIO_SET_IOMMU fails and
> If the underlying

got it.

> > +userspace should try other IOMMU types. Details of the nesting cap info
> > +can be found in Documentation/userspace-api/iommu.rst.
> > +
> > +The stage-1 page table can be bound to the IOMMU in two methods: directly>
> +or indirectly. Direct binding requires userspace to notify VFIO of every
> Not sure we shall use this direct/indirect terminology. I don't think
> this is part of either ARM or Intel SPEC.
> 
> Suggestion: On Intel, the stage1 page table info are mediated 

RE: [PATCH v6 11/15] vfio/type1: Allow invalidating first-level/stage IOMMU cache

2020-08-17 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Sunday, August 16, 2020 7:35 PM
> 
> Hi Yi,
> 
> On 7/28/20 8:27 AM, Liu Yi L wrote:
> > This patch provides an interface allowing the userspace to invalidate
> > IOMMU cache for first-level page table. It is required when the first
> > level IOMMU page table is not managed by the host kernel in the nested
> > translation setup.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Eric Auger 
> > Signed-off-by: Jacob Pan 
> > ---
> > v1 -> v2:
> > *) rename from "vfio/type1: Flush stage-1 IOMMU cache for nesting type"
> > *) rename vfio_cache_inv_fn() to vfio_dev_cache_invalidate_fn()
> > *) vfio_dev_cache_inv_fn() always successful
> > *) remove VFIO_IOMMU_CACHE_INVALIDATE, and reuse
> VFIO_IOMMU_NESTING_OP
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 42
> +
> >  include/uapi/linux/vfio.h   |  3 +++
> >  2 files changed, 45 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index 245436e..bf95a0f 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -3056,6 +3056,45 @@ static long vfio_iommu_handle_pgtbl_op(struct
> vfio_iommu *iommu,
> > return ret;
> >  }
> >
> > +static int vfio_dev_cache_invalidate_fn(struct device *dev, void
> > +*data) {
> > +   struct domain_capsule *dc = (struct domain_capsule *)data;
> > +   unsigned long arg = *(unsigned long *)dc->data;
> > +
> > +   iommu_uapi_cache_invalidate(dc->domain, dev, (void __user *)arg);
> > +   return 0;
> > +}
> > +
> > +static long vfio_iommu_invalidate_cache(struct vfio_iommu *iommu,
> > +   unsigned long arg)
> > +{
> > +   struct domain_capsule dc = { .data =  };
> > +   struct iommu_nesting_info *info;
> > +   int ret;
> > +
> > +   mutex_lock(>lock);
> > +   /*
> > +* Cache invalidation is required for any nesting IOMMU,
> So why do we expose the IOMMU_NESTING_FEAT_CACHE_INVLD capability? :-)

it's a stale comment. should be removed. :-)

> > +* so no need to check system-wide PASID support.
> > +*/
> > +   info = iommu->nesting_info;
> > +   if (!info || !(info->features & IOMMU_NESTING_FEAT_CACHE_INVLD)) {
> > +   ret = -EOPNOTSUPP;
> > +   goto out_unlock;
> > +   }
> > +
> > +   ret = vfio_get_nesting_domain_capsule(iommu, );
> > +   if (ret)
> > +   goto out_unlock;
> > +
> > +   iommu_group_for_each_dev(dc.group->iommu_group, ,
> > +vfio_dev_cache_invalidate_fn);
> > +
> > +out_unlock:
> > +   mutex_unlock(>lock);
> > +   return ret;
> > +}
> > +
> >  static long vfio_iommu_type1_nesting_op(struct vfio_iommu *iommu,
> > unsigned long arg)
> >  {
> > @@ -3078,6 +3117,9 @@ static long vfio_iommu_type1_nesting_op(struct
> vfio_iommu *iommu,
> > case VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL:
> > ret = vfio_iommu_handle_pgtbl_op(iommu, false, arg + minsz);
> > break;
> > +   case VFIO_IOMMU_NESTING_OP_CACHE_INVLD:
> > +   ret = vfio_iommu_invalidate_cache(iommu, arg + minsz);
> > +   break;
> > default:
> > ret = -EINVAL;
> > }
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 9501cfb..48e2fb5 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -1225,6 +1225,8 @@ struct vfio_iommu_type1_pasid_request {
> >   * +-+---+
> >   * | UNBIND_PGTBL|  struct iommu_gpasid_bind_data|
> >   *
> > +-+---+
> > + * | CACHE_INVLD |  struct iommu_cache_invalidate_info   |
> > + *
> > + +-+---+
> >   *
> >   * returns: 0 on success, -errno on failure.
> >   */
> > @@ -1237,6 +1239,7 @@ struct vfio_iommu_type1_nesting_op {
> >
> >  #define VFIO_IOMMU_NESTING_OP_BIND_PGTBL   (0)
> >  #define VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL (1)
> > +#define VFIO_IOMMU_NESTING_OP_CACHE_INVLD  (2)
> According to my previous comment, you may refine VFIO_NESTING_OP_MASK too

yes, I've noticed it. also replied in patch 10/15.

Regards,
Yi Liu

> Thanks
> 
> Eric
> >
> >  #define VFIO_IOMMU_NESTING_OP  _IO(VFIO_TYPE, VFIO_BASE + 19)
> >
> >



RE: [PATCH v6 10/15] vfio/type1: Support binding guest page tables to PASID

2020-08-17 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Sunday, August 16, 2020 7:29 PM
> 
> Hi Yi,
> 
> On 7/28/20 8:27 AM, Liu Yi L wrote:
> > Nesting translation allows two-levels/stages page tables, with 1st level
> > for guest translations (e.g. GVA->GPA), 2nd level for host translations
> > (e.g. GPA->HPA). This patch adds interface for binding guest page tables
> > to a PASID. This PASID must have been allocated by the userspace before
> > the binding request.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Jean-Philippe Brucker 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Jacob Pan 
> > ---
> > v5 -> v6:
> > *) dropped vfio_find_nesting_group() and add 
> > vfio_get_nesting_domain_capsule().
> >per comment from Eric.
> > *) use iommu_uapi_sva_bind/unbind_gpasid() and iommu_sva_unbind_gpasid() in
> >linux/iommu.h for userspace operation and in-kernel operation.
> >
> > v3 -> v4:
> > *) address comments from Alex on v3
> >
> > v2 -> v3:
> > *) use __iommu_sva_unbind_gpasid() for unbind call issued by VFIO
> >https://lore.kernel.org/linux-iommu/1592931837-58223-6-git-send-email-
> jacob.jun@linux.intel.com/
> >
> > v1 -> v2:
> > *) rename subject from "vfio/type1: Bind guest page tables to host"
> > *) remove VFIO_IOMMU_BIND, introduce VFIO_IOMMU_NESTING_OP to support
> bind/
> >unbind guet page table
> > *) replaced vfio_iommu_for_each_dev() with a group level loop since this
> >series enforces one group per container w/ nesting type as start.
> > *) rename vfio_bind/unbind_gpasid_fn() to vfio_dev_bind/unbind_gpasid_fn()
> > *) vfio_dev_unbind_gpasid() always successful
> > *) use vfio_mm->pasid_lock to avoid race between PASID free and page table
> >bind/unbind
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 161
> 
> >  drivers/vfio/vfio_pasid.c   |  26 +++
> >  include/linux/vfio.h|  20 +
> >  include/uapi/linux/vfio.h   |  31 
> >  4 files changed, 238 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c 
> > b/drivers/vfio/vfio_iommu_type1.c
> > index ea89c7c..245436e 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -149,6 +149,36 @@ struct vfio_regions {
> >  #define DIRTY_BITMAP_PAGES_MAX  ((u64)INT_MAX)
> >  #define DIRTY_BITMAP_SIZE_MAX
> DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
> >
> > +struct domain_capsule {
> > +   struct vfio_group *group;
> > +   struct iommu_domain *domain;
> > +   void *data;
> You may add a bool indicating whether the data is a user pointer or the
> direct IOMMU UAPI struct.

yes, it is helpful.

> > +};
> > +
> > +/* iommu->lock must be held */
> > +static int vfio_get_nesting_domain_capsule(struct vfio_iommu *iommu,
> > +  struct domain_capsule *dc)
> I would rename the function into vfio_prepare_nesting_domain_capsule

got it. :-)

> > +{
> > +   struct vfio_domain *domain = NULL;
> > +   struct vfio_group *group = NULL;
> > +
> > +   if (!iommu->nesting_info)
> > +   return -EINVAL;
> > +
> > +   /*
> > +* Only support singleton container with nesting type.
> > +* If nesting_info is non-NULL, the conatiner should
> s/should/is here and below
> s/conatiner/container

got it. thanks.

> > +* be non-empty. Also domain should be non-empty.
> > +*/
> > +   domain = list_first_entry(>domain_list,
> > + struct vfio_domain, next);
> > +   group = list_first_entry(>group_list,
> > +struct vfio_group, next);
> > +   dc->group = group;
> > +   dc->domain = domain->domain;
> dc->user = true;?

yep.

> > +   return 0;
> > +}
> > +
> >  static int put_pfn(unsigned long pfn, int prot);
> >
> >  static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu
> *iommu,
> > @@ -2349,6 +2379,48 @@ static int vfio_iommu_resv_refresh(struct vfio_iommu
> *iommu,
> > return ret;
> >  }
> >
> > +static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
> > +{
> > +   struct domain_capsule *dc = (struct domain_capsule *)data;
> > +   unsigned long arg 

RE: [PATCH v6 09/15] iommu/vt-d: Check ownership for PASIDs from user-space

2020-08-16 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Sunday, August 16, 2020 12:30 AM
> 
> Hi Yi,
> 
> On 7/28/20 8:27 AM, Liu Yi L wrote:
> > When an IOMMU domain with nesting attribute is used for guest SVA, a
> > system-wide PASID is allocated for binding with the device and the domain.
> > For security reason, we need to check the PASID passed from user-space.
> > e.g. page table bind/unbind and PASID related cache invalidation.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Jacob Pan 
> > ---
> >  drivers/iommu/intel/iommu.c | 10 ++
> >  drivers/iommu/intel/svm.c   |  7 +--
> >  2 files changed, 15 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index b2fe54e..88f4647 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -5436,6 +5436,7 @@ intel_iommu_sva_invalidate(struct iommu_domain
> *domain, struct device *dev,
> > int granu = 0;
> > u64 pasid = 0;
> > u64 addr = 0;
> > +   void *pdata;
> >
> > granu = to_vtd_granularity(cache_type, inv_info->granularity);
> > if (granu == -EINVAL) {
> > @@ -5456,6 +5457,15 @@ intel_iommu_sva_invalidate(struct iommu_domain
> *domain, struct device *dev,
> >  (inv_info->granu.addr_info.flags &
> IOMMU_INV_ADDR_FLAGS_PASID))
> > pasid = inv_info->granu.addr_info.pasid;
> >
> > +   pdata = ioasid_find(dmar_domain->ioasid_sid, pasid, NULL);
> > +   if (!pdata) {
> > +   ret = -EINVAL;
> > +   goto out_unlock;
> > +   } else if (IS_ERR(pdata)) {
> > +   ret = PTR_ERR(pdata);
> > +   goto out_unlock;
> > +   }
> > +
> > switch (BIT(cache_type)) {
> > case IOMMU_CACHE_INV_TYPE_IOTLB:
> > /* HW will ignore LSB bits based on address mask */
> > diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> > index c85b8d5..b9b29ad 100644
> > --- a/drivers/iommu/intel/svm.c
> > +++ b/drivers/iommu/intel/svm.c
> > @@ -323,7 +323,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain,
> struct device *dev,
> > dmar_domain = to_dmar_domain(domain);
> >
> > mutex_lock(_mutex);
> > -   svm = ioasid_find(INVALID_IOASID_SET, data->hpasid, NULL);
> > +   svm = ioasid_find(dmar_domain->ioasid_sid, data->hpasid, NULL);
> A question about the locking strategy. We don't take the
> device_domain_lock here. Could you clarify whether it is safe?

I guess it is better to take the same lock as what iommu_domain_set_attr()
takes. thanks for catching it. :-)

> 
> > if (IS_ERR(svm)) {
> > ret = PTR_ERR(svm);
> > goto out;
> > @@ -440,6 +440,7 @@ int intel_svm_unbind_gpasid(struct iommu_domain
> *domain,
> > struct device *dev, u32 pasid)
> >  {
> > struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > +   struct dmar_domain *dmar_domain;
> > struct intel_svm_dev *sdev;
> > struct intel_svm *svm;
> > int ret = -EINVAL;
> > @@ -447,8 +448,10 @@ int intel_svm_unbind_gpasid(struct iommu_domain
> *domain,
> > if (WARN_ON(!iommu))
> > return -EINVAL;
> >
> > +   dmar_domain = to_dmar_domain(domain);
> > +
> > mutex_lock(_mutex);
> > -   svm = ioasid_find(INVALID_IOASID_SET, pasid, NULL);
> > +   svm = ioasid_find(dmar_domain->ioasid_sid, pasid, NULL);
> same here.

same.

Regards,
Yi Liu

> > if (!svm) {
> > ret = -EINVAL;
> > goto out;
> >
> Thanks
> 
> Eric



RE: [PATCH v6 07/15] vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free)

2020-08-16 Thread Liu, Yi L
Thanks, Eric.

Regards,
Yi Liu

> From: Auger Eric 
> Sent: Sunday, August 16, 2020 12:30 AM
> 
> Yi,
> 
> On 7/28/20 8:27 AM, Liu Yi L wrote:
> > This patch allows userspace to request PASID allocation/free, e.g.
> > when serving the request from the guest.
> >
> > PASIDs that are not freed by userspace are automatically freed when
> > the IOASID set is destroyed when process exits.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Yi Sun 
> > Signed-off-by: Jacob Pan 
> > ---
> > v5 -> v6:
> > *) address comments from Eric against v5. remove the alloc/free helper.
> >
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > *) the comments for the PASID_FREE request is addressed in patch 5/15 of
> >this series.
> >
> > v3 -> v4:
> > *) address comments from v3, except the below comment against the range
> >of PASID_FREE request. needs more help on it.
> > "> +if (req.range.min > req.range.max)
> >
> >  Is it exploitable that a user can spin the kernel for a long time in
> >  the case of a free by calling this with [0, MAX_UINT] regardless of
> >  their actual allocations?"
> >
> > https://lore.kernel.org/linux-iommu/20200702151832.048b4...@x1.home/
> >
> > v1 -> v2:
> > *) move the vfio_mm related code to be a seprate module
> > *) use a single structure for alloc/free, could support a range of
> > PASIDs
> > *) fetch vfio_mm at group_attach time instead of at iommu driver open
> > time
> > ---
> >  drivers/vfio/Kconfig|  1 +
> >  drivers/vfio/vfio_iommu_type1.c | 69
> +
> >  drivers/vfio/vfio_pasid.c   | 10 ++
> >  include/linux/vfio.h|  6 
> >  include/uapi/linux/vfio.h   | 37 ++
> >  5 files changed, 123 insertions(+)
> >
> > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index
> > 3d8a108..95d90c6 100644
> > --- a/drivers/vfio/Kconfig
> > +++ b/drivers/vfio/Kconfig
> > @@ -2,6 +2,7 @@
> >  config VFIO_IOMMU_TYPE1
> > tristate
> > depends on VFIO
> > +   select VFIO_PASID if (X86)
> > default n
> >
> >  config VFIO_IOMMU_SPAPR_TCE
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index 18ff0c3..ea89c7c 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -76,6 +76,7 @@ struct vfio_iommu {
> > booldirty_page_tracking;
> > boolpinned_page_dirty_scope;
> > struct iommu_nesting_info   *nesting_info;
> > +   struct vfio_mm  *vmm;
> >  };
> >
> >  struct vfio_domain {
> > @@ -1937,6 +1938,11 @@ static void vfio_iommu_iova_insert_copy(struct
> > vfio_iommu *iommu,
> >
> >  static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
> > {
> > +   if (iommu->vmm) {
> > +   vfio_mm_put(iommu->vmm);
> > +   iommu->vmm = NULL;
> > +   }
> > +
> > kfree(iommu->nesting_info);
> > iommu->nesting_info = NULL;
> >  }
> > @@ -2071,6 +2077,26 @@ static int vfio_iommu_type1_attach_group(void
> *iommu_data,
> > iommu->nesting_info);
> > if (ret)
> > goto out_detach;
> > +
> > +   if (iommu->nesting_info->features &
> > +   IOMMU_NESTING_FEAT_SYSWIDE_PASID)
> {
> > +   struct vfio_mm *vmm;
> > +   int sid;
> > +
> > +   vmm = vfio_mm_get_from_task(current);
> > +   if (IS_ERR(vmm)) {
> > +   ret = PTR_ERR(vmm);
> > +   goto out_detach;
> > +   }
> > +   iommu->vmm = vmm;
> > +
> > +   sid = vfio_mm_ioasid_sid(vmm);
> > +   ret = iommu_domain_set_attr(domain->domain,
> > +   DOMAIN_ATTR_IOASID_SID,
> > +   );
> > +   if (ret)
> > +   goto out_detach;
> > +  

RE: [PATCH v6 06/15] iommu/vt-d: Support setting ioasid set to domain

2020-08-14 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Thursday, August 13, 2020 11:07 PM
> 
> Hi Yi,
> 
> On 7/28/20 8:27 AM, Liu Yi L wrote:
> > From IOMMU p.o.v., PASIDs allocated and managed by external components
> > (e.g. VFIO) will be passed in for gpasid_bind/unbind operation. IOMMU
> > needs some knowledge to check the PASID ownership, hence add an
> > interface for those components to tell the PASID owner.
> >
> > In latest kernel design, PASID ownership is managed by IOASID set
> > where the PASID is allocated from. This patch adds support for setting
> > ioasid set ID to the domains used for nesting/vSVA. Subsequent SVA
> > operations will check the PASID against its IOASID set for proper ownership.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Jacob Pan 
> > ---
> > v5 -> v6:
> > *) address comments against v5 from Eric Auger.
> >
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > ---
> >  drivers/iommu/intel/iommu.c | 23 +++
> > include/linux/intel-iommu.h |  4 
> >  include/linux/iommu.h   |  1 +
> >  3 files changed, 28 insertions(+)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index ed4b71c..b2fe54e 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -1793,6 +1793,7 @@ static struct dmar_domain *alloc_domain(int flags)
> > if (first_level_by_default())
> > domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
> > domain->has_iotlb_device = false;
> > +   domain->ioasid_sid = INVALID_IOASID_SET;
> > INIT_LIST_HEAD(>devices);
> >
> > return domain;
> > @@ -6040,6 +6041,28 @@ intel_iommu_domain_set_attr(struct iommu_domain
> *domain,
> > }
> > spin_unlock_irqrestore(_domain_lock, flags);
> > break;
> > +   case DOMAIN_ATTR_IOASID_SID:
> > +   {
> > +   int sid = *(int *)data;
> 
> > +
> > +   spin_lock_irqsave(_domain_lock, flags);
> > +   if (!(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE)) {
> > +   ret = -ENODEV;
> > +   spin_unlock_irqrestore(_domain_lock, flags);
> > +   break;
> > +   }
> > +   if (dmar_domain->ioasid_sid != INVALID_IOASID_SET &&
> > +   dmar_domain->ioasid_sid != sid) {
> > +   pr_warn_ratelimited("multi ioasid_set (%d:%d) setting",
> > +   dmar_domain->ioasid_sid, sid);
> > +   ret = -EBUSY;
> > +   spin_unlock_irqrestore(_domain_lock, flags);
> > +   break;
> > +   }
> > +   dmar_domain->ioasid_sid = sid;
> > +   spin_unlock_irqrestore(_domain_lock, flags);
> > +   break;
> nit: Adding a small helper
> int__set_ioasid_sid(struct dmar_domain *dmar_domain, int sid_id)
> 
> may simplify the lock handling

ok. will do.

> 
> > +   }
> > default:
> > ret = -EINVAL;
> > break;
> > diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> > index 3f23c26..0d0ab32 100644
> > --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -549,6 +549,10 @@ struct dmar_domain {
> >2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
> > u64 max_addr;   /* maximum mapped address */
> >
> > +   int ioasid_sid; /*
> > +* the ioasid set which tracks all
> id of the ioasid set?

should be ioasid_set. however, ioasid_alloc_set() returns sid in Jacob's
series. but, I heard from Jacob, he will remove ioasid_sid, and return
ioasid_set. so I will modify it once his patch is sent out.

https://lore.kernel.org/linux-iommu/1585158931-1825-4-git-send-email-jacob.jun@linux.intel.com/

> > +* PASIDs used by the domain.
> > +*/
> > int default_pasid;  /*
> >  * The default pasid used for non-SVM
> >  * traffic on mediated devices.
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h index
> > 4a02c9e..b1ff702 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -124,6 +124,7 @@ enum iommu_attr {
> > DOMAIN_ATTR_FSL_PAMUV1,
> > DOMAIN_ATTR_NESTING,/* two stages of translation */
> > DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE,
> > +   DOMAIN_ATTR_IOASID_SID,
> > DOMAIN_ATTR_MAX,
> >  };
> >
> >
> Besides
> Reviewed-by: Eric Auger 

thanks :-)

Regards,
Yi Liu

> 
> Eric



RE: [PATCH v6 05/15] vfio: Add PASID allocation/free support

2020-08-14 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Thursday, August 13, 2020 11:07 PM
> 
> Yi,
> 
> On 7/28/20 8:27 AM, Liu Yi L wrote:
> > Shared Virtual Addressing (a.k.a Shared Virtual Memory) allows sharing
> > multiple process virtual address spaces with the device for simplified
> > programming model. PASID is used to tag an virtual address space in
> > DMA requests and to identify the related translation structure in
> > IOMMU. When a PASID-capable device is assigned to a VM, we want the
> > same capability of using PASID to tag guest process virtual address
> > spaces to achieve virtual SVA (vSVA).
> >
> > PASID management for guest is vendor specific. Some vendors (e.g.
> > Intel
> > VT-d) requires system-wide managed PASIDs across all devices,
> > regardless of whether a device is used by host or assigned to guest.
> > Other vendors (e.g. ARM SMMU) may allow PASIDs managed per-device thus
> > could be fully delegated to the guest for assigned devices.
> >
> > For system-wide managed PASIDs, this patch introduces a vfio module to
> > handle explicit PASID alloc/free requests from guest. Allocated PASIDs
> > are associated to a process (or, mm_struct) in IOASID core. A vfio_mm
> > object is introduced to track mm_struct. Multiple VFIO containers
> > within a process share the same vfio_mm object.
> >
> > A quota mechanism is provided to prevent malicious user from
> > exhausting available PASIDs. Currently the quota is a global parameter
> > applied to all VFIO devices. In the future per-device quota might be 
> > supported too.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Suggested-by: Alex Williamson 
> > Signed-off-by: Liu Yi L 
> > ---
> > v5 -> v6:
> > *) address comments from Eric. Add vfio_unlink_pasid() to be consistent
> >with vfio_unlink_dma(). Add a comment in vfio_pasid_exit().
> >
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > *) address the comments from Alex on the pasid free range support. Added
> >per vfio_mm pasid r-b tree.
> >https://lore.kernel.org/kvm/20200709082751.32074...@x1.home/
> >
> > v3 -> v4:
> > *) fix lock leam in vfio_mm_get_from_task()
> > *) drop pasid_quota field in struct vfio_mm
> > *) vfio_mm_get_from_task() returns ERR_PTR(-ENOTTY) when
> > !CONFIG_VFIO_PASID
> >
> > v1 -> v2:
> > *) added in v2, split from the pasid alloc/free support of v1
> > ---
> >  drivers/vfio/Kconfig  |   5 +
> >  drivers/vfio/Makefile |   1 +
> >  drivers/vfio/vfio_pasid.c | 248
> ++
> >  include/linux/vfio.h  |  28 ++
> >  4 files changed, 282 insertions(+)
> >  create mode 100644 drivers/vfio/vfio_pasid.c
> >
> > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index
> > fd17db9..3d8a108 100644
> > --- a/drivers/vfio/Kconfig
> > +++ b/drivers/vfio/Kconfig
> > @@ -19,6 +19,11 @@ config VFIO_VIRQFD
> > depends on VFIO && EVENTFD
> > default n
> >
> > +config VFIO_PASID
> > +   tristate
> > +   depends on IOASID && VFIO
> > +   default n
> > +
> >  menuconfig VFIO
> > tristate "VFIO Non-Privileged userspace driver framework"
> > depends on IOMMU_API
> > diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index
> > de67c47..bb836a3 100644
> > --- a/drivers/vfio/Makefile
> > +++ b/drivers/vfio/Makefile
> > @@ -3,6 +3,7 @@ vfio_virqfd-y := virqfd.o
> >
> >  obj-$(CONFIG_VFIO) += vfio.o
> >  obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
> > +obj-$(CONFIG_VFIO_PASID) += vfio_pasid.o
> >  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
> >  obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
> >  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o diff --git
> > a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c new file mode
> > 100644 index 000..befcf29
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_pasid.c
> > @@ -0,0 +1,248 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (C) 2020 Intel Corporation.
> > + * Author: Liu Yi L 
> > + *
> > + */
> > +
> > +#include 
> > +#include 
> not needed

oh, yes. 

> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#define DRIVER_VERSION  "0.1"
> > +#define DRIVER_AUTHOR   

RE: [PATCH v6 04/15] vfio/type1: Report iommu nesting info to userspace

2020-08-14 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Thursday, August 13, 2020 9:20 PM
>
> Hi Yi,
> On 7/28/20 8:27 AM, Liu Yi L wrote:
> > This patch exports iommu nesting capability info to user space through
> > VFIO. Userspace is expected to check this info for supported uAPIs (e.g.
> > PASID alloc/free, bind page table, and cache invalidation) and the
> > vendor specific format information for first level/stage page table
> > that will be bound to.
> >
> > The nesting info is available only after container set to be NESTED type.
> > Current implementation imposes one limitation - one nesting container
> > should include at most one iommu group. The philosophy of vfio
> > container is having all groups/devices within the container share the
> > same IOMMU context. When vSVA is enabled, one IOMMU context could
> > include one 2nd- level address space and multiple 1st-level address
> > spaces. While the 2nd-level address space is reasonably sharable by
> > multiple groups, blindly sharing 1st-level address spaces across all
> > groups within the container might instead break the guest expectation.
> > In the future sub/super container concept might be introduced to allow
> > partial address space sharing within an IOMMU context. But for now
> > let's go with this restriction by requiring singleton container for
> > using nesting iommu features. Below link has the related discussion about 
> > this
> decision.
> >
> > https://lore.kernel.org/kvm/20200515115924.37e69...@w520.home/
> >
> > This patch also changes the NESTING type container behaviour.
> > Something that would have succeeded before will now fail: Before this
> > series, if user asked for a VFIO_IOMMU_TYPE1_NESTING, it would have
> > succeeded even if the SMMU didn't support stage-2, as the driver would
> > have silently fallen back on stage-1 mappings (which work exactly the
> > same as stage-2 only since there was no nesting supported). After the
> > series, we do check for DOMAIN_ATTR_NESTING so if user asks for
> > VFIO_IOMMU_TYPE1_NESTING and the SMMU doesn't support stage-2, the
> > ioctl fails. But it should be a good fix and completely harmless. Detail 
> > can be found
> in below link as well.
> >
> > https://lore.kernel.org/kvm/20200717090900.GC4850@myrica/
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > ---
> > v5 -> v6:
> > *) address comments against v5 from Eric Auger.
> > *) don't report nesting cap to userspace if the nesting_info->format is
> >invalid.
> >
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > *) return struct iommu_nesting_info for
> VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as
> >cap is much "cheap", if needs extension in future, just define another 
> > cap.
> >https://lore.kernel.org/kvm/20200708132947.5b7ee...@x1.home/
> >
> > v3 -> v4:
> > *) address comments against v3.
> >
> > v1 -> v2:
> > *) added in v2
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 106
> +++-
> >  include/uapi/linux/vfio.h   |  19 +++
> >  2 files changed, 113 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index 3bd70ff..18ff0c3 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -62,18 +62,20 @@ MODULE_PARM_DESC(dma_entry_limit,
> >  "Maximum number of user DMA mappings per container (65535).");
> >
> >  struct vfio_iommu {
> > -   struct list_headdomain_list;
> > -   struct list_headiova_list;
> > -   struct vfio_domain  *external_domain; /* domain for external user */
> > -   struct mutexlock;
> > -   struct rb_root  dma_list;
> > -   struct blocking_notifier_head notifier;
> > -   unsigned intdma_avail;
> > -   uint64_tpgsize_bitmap;
> > -   boolv2;
> > -   boolnesting;
> > -   booldirty_page_tracking;
> > -   boolpinned_page_dirty_scope;
> > +   struct list_headdomain_list;
> > +   struct list_headiova_list;
> > +   /* domain for external user */
> > +   struct vfio_domain  *external_domain;
> > +   

RE: [PATCH v6 02/15] iommu: Report domain nesting info

2020-08-14 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Thursday, August 13, 2020 8:53 PM
> 
> Yi,
> On 7/28/20 8:27 AM, Liu Yi L wrote:
> > IOMMUs that support nesting translation needs report the capability info
> s/needs/need to
> > to userspace. It gives information about requirements the userspace needs
> > to implement plus other features characterizing the physical implementation.
> >
> > This patch reports nesting info by DOMAIN_ATTR_NESTING. Caller can get
> > nesting info after setting DOMAIN_ATTR_NESTING. For VFIO, it is after
> > selecting VFIO_TYPE1_NESTING_IOMMU.
> This is not what this patch does ;-) It introduces a new IOMMU UAPI
> struct that gives information about the nesting capabilities and
> features. This struct is supposed to be returned by
> iommu_domain_get_attr() with DOMAIN_ATTR_NESTING attribute parameter,
> one a domain whose type has been set to DOMAIN_ATTR_NESTING.

got it. let me apply your suggestion. thanks. :-)

> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Jacob Pan 
> > ---
> > v5 -> v6:
> > *) rephrase the feature notes per comments from Eric Auger.
> > *) rename @size of struct iommu_nesting_info to @argsz.
> >
> > v4 -> v5:
> > *) address comments from Eric Auger.
> >
> > v3 -> v4:
> > *) split the SMMU driver changes to be a separate patch
> > *) move the @addr_width and @pasid_bits from vendor specific
> >part to generic part.
> > *) tweak the description for the @features field of struct
> >iommu_nesting_info.
> > *) add description on the @data[] field of struct iommu_nesting_info
> >
> > v2 -> v3:
> > *) remvoe cap/ecap_mask in iommu_nesting_info.
> > *) reuse DOMAIN_ATTR_NESTING to get nesting info.
> > *) return an empty iommu_nesting_info for SMMU drivers per Jean'
> >suggestion.
> > ---
> >  include/uapi/linux/iommu.h | 74
> ++
> >  1 file changed, 74 insertions(+)
> >
> > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > index 7c8e075..5e4745a 100644
> > --- a/include/uapi/linux/iommu.h
> > +++ b/include/uapi/linux/iommu.h
> > @@ -332,4 +332,78 @@ struct iommu_gpasid_bind_data {
> > } vendor;
> >  };
> >
> > +/*
> > + * struct iommu_nesting_info - Information for nesting-capable IOMMU.
> > + *userspace should check it before using
> > + *nesting capability.
> > + *
> > + * @argsz: size of the whole structure.
> > + * @flags: currently reserved for future extension. must set to 0.
> > + * @format:PASID table entry format, the same definition as struct
> > + * iommu_gpasid_bind_data @format.
> > + * @features:  supported nesting features.
> > + * @addr_width:The output addr width of first level/stage translation
> > + * @pasid_bits:Maximum supported PASID bits, 0 represents no PASID
> > + * support.
> > + * @data:  vendor specific cap info. data[] structure type can be deduced
> > + * from @format field.
> > + *
> > + *
> +===+===
> ===+
> > + * | feature   |  Notes   |
> > + *
> +===+===
> ===+
> > + * | SYSWIDE_PASID |  IOMMU vendor driver sets it to mandate userspace|
> > + * |   |  to allocate PASID from kernel. All PASID allocation |
> > + * |   |  free must be mediated through the TBD API.  |
> s/TBD/IOMMU

got it.

> > + * +---+--+
> > + * | BIND_PGTBL|  IOMMU vendor driver sets it to mandate userspace|
> > + * |   |  bind the first level/stage page table to associated |
> s/bind/to bind

got it.

> > + * |   |  PASID (either the one specified in bind request or  |
> > + * |   |  the default PASID of iommu domain), through IOMMU   |
> > + * |   |  UAPI.   |
> > + * +---+--+
> > + * | CACHE_INVLD   |  IOMMU vendor driver sets it to mandate userspace  

RE: [PATCH v7 6/7] iommu/uapi: Handle data and argsz filled by users

2020-08-13 Thread Liu, Yi L
> From: Auger Eric 
> Sent: Thursday, August 13, 2020 5:31 PM
> 
> Hi Yi,
> 
> On 8/13/20 11:25 AM, Liu, Yi L wrote:
> > Hi Eric,
> >
> >
> >> From: Auger Eric 
> >> Sent: Thursday, August 13, 2020 5:12 PM
> >>
> >> Hi Jacob,
> >>
> >> On 7/30/20 2:21 AM, Jacob Pan wrote:
> >>> IOMMU user APIs are responsible for processing user data. This patch
> >>> changes the interface such that user pointers can be passed into
> >>> IOMMU code directly. Separate kernel APIs without user pointers are
> >>> introduced for in-kernel users of the UAPI functionality.
> >> This is just done for a single function, ie. iommu_sva_unbind_gpasid.
> >>
> >> If I am not wrong there is no user of this latter after applying the
> >> whole series? If correct you may remove it at this stage?
> >
> > the user of this function is in vfio. And it is the same with
> > iommu_uapi_sva_bind/unbind_gpasid() and iommu_uapi_cache_invalidate().
> >
> > https://lore.kernel.org/kvm/1595917664-33276-11-git-send-email-yi.l.li
> > u...@intel.com/
> > https://lore.kernel.org/kvm/1595917664-33276-12-git-send-email-yi.l.li
> > u...@intel.com/
> Yep I know ;-) But this series mostly deals with iommu uapi rework.
> That's not a big deal though.

I see. btw. it's great if you can take a look on vfio v6 to see if your comments
are well addressed. :-)

Regards,
Yi Liu

> Thanks
> 
> Eric
> >
> > Regards,
> > Yi Liu
> >
> >>>
> >>> IOMMU UAPI data has a user filled argsz field which indicates the
> >>> data length of the structure. User data is not trusted, argsz must
> >>> be validated based on the current kernel data size, mandatory data
> >>> size, and feature flags.
> >>>
> >>> User data may also be extended, resulting in possible argsz increase.
> >>> Backward compatibility is ensured based on size and flags (or the
> >>> functional equivalent fields) checking.
> >>>
> >>> This patch adds sanity checks in the IOMMU layer. In addition to
> >>> argsz, reserved/unused fields in padding, flags, and version are also 
> >>> checked.
> >>> Details are documented in Documentation/userspace-api/iommu.rst
> >>>
> >>> Signed-off-by: Liu Yi L 
> >>> Signed-off-by: Jacob Pan 
> >>> ---
> >>>  drivers/iommu/iommu.c | 201
> >> --
> >>>  include/linux/iommu.h |  28 ---
> >>>  2 files changed, 212 insertions(+), 17 deletions(-)
> >>>
> >>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index
> >>> 3a913ce94a3d..1ee55c4b3a3a 100644
> >>> --- a/drivers/iommu/iommu.c
> >>> +++ b/drivers/iommu/iommu.c
> >>> @@ -1950,33 +1950,218 @@ int iommu_attach_device(struct iommu_domain
> >> *domain, struct device *dev)
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(iommu_attach_device);
> >>>
> >>> +/*
> >>> + * Check flags and other user provided data for valid combinations.
> >>> +We also
> >>> + * make sure no reserved fields or unused flags are set. This is to
> >>> +ensure
> >>> + * not breaking userspace in the future when these fields or flags are 
> >>> used.
> >>> + */
> >>> +static int iommu_check_cache_invl_data(struct
> >>> +iommu_cache_invalidate_info
> >> *info)
> >>> +{
> >>> + u32 mask;
> >>> + int i;
> >>> +
> >>> + if (info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> >>> + return -EINVAL;
> >>> +
> >>> + mask = (1 << IOMMU_CACHE_INV_TYPE_NR) - 1;
> >>> + if (info->cache & ~mask)
> >>> + return -EINVAL;
> >>> +
> >>> + if (info->granularity >= IOMMU_INV_GRANU_NR)
> >>> + return -EINVAL;
> >>> +
> >>> + switch (info->granularity) {
> >>> + case IOMMU_INV_GRANU_ADDR:
> >>> + if (info->cache & IOMMU_CACHE_INV_TYPE_PASID)
> >>> + return -EINVAL;
> >>> +
> >>> + mask = IOMMU_INV_ADDR_FLAGS_PASID |
> >>> + IOMMU_INV_ADDR_FLAGS_ARCHID |
> >>> + IOMMU_INV_ADDR_FLAGS_LEAF;
> >>> +
> >>> + if (info->granu.addr_info.flags & ~mask)
> >>> 

RE: [PATCH v7 6/7] iommu/uapi: Handle data and argsz filled by users

2020-08-13 Thread Liu, Yi L
Hi Eric,


> From: Auger Eric 
> Sent: Thursday, August 13, 2020 5:12 PM
> 
> Hi Jacob,
> 
> On 7/30/20 2:21 AM, Jacob Pan wrote:
> > IOMMU user APIs are responsible for processing user data. This patch
> > changes the interface such that user pointers can be passed into IOMMU
> > code directly. Separate kernel APIs without user pointers are introduced
> > for in-kernel users of the UAPI functionality.
> This is just done for a single function, ie. iommu_sva_unbind_gpasid.
> 
> If I am not wrong there is no user of this latter after applying the
> whole series? If correct you may remove it at this stage?

the user of this function is in vfio. And it is the same with
iommu_uapi_sva_bind/unbind_gpasid() and iommu_uapi_cache_invalidate().

https://lore.kernel.org/kvm/1595917664-33276-11-git-send-email-yi.l@intel.com/
https://lore.kernel.org/kvm/1595917664-33276-12-git-send-email-yi.l@intel.com/

Regards,
Yi Liu

> >
> > IOMMU UAPI data has a user filled argsz field which indicates the data
> > length of the structure. User data is not trusted, argsz must be
> > validated based on the current kernel data size, mandatory data size,
> > and feature flags.
> >
> > User data may also be extended, resulting in possible argsz increase.
> > Backward compatibility is ensured based on size and flags (or
> > the functional equivalent fields) checking.
> >
> > This patch adds sanity checks in the IOMMU layer. In addition to argsz,
> > reserved/unused fields in padding, flags, and version are also checked.
> > Details are documented in Documentation/userspace-api/iommu.rst
> >
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Jacob Pan 
> > ---
> >  drivers/iommu/iommu.c | 201
> --
> >  include/linux/iommu.h |  28 ---
> >  2 files changed, 212 insertions(+), 17 deletions(-)
> >
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 3a913ce94a3d..1ee55c4b3a3a 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -1950,33 +1950,218 @@ int iommu_attach_device(struct iommu_domain
> *domain, struct device *dev)
> >  }
> >  EXPORT_SYMBOL_GPL(iommu_attach_device);
> >
> > +/*
> > + * Check flags and other user provided data for valid combinations. We also
> > + * make sure no reserved fields or unused flags are set. This is to ensure
> > + * not breaking userspace in the future when these fields or flags are 
> > used.
> > + */
> > +static int iommu_check_cache_invl_data(struct iommu_cache_invalidate_info
> *info)
> > +{
> > +   u32 mask;
> > +   int i;
> > +
> > +   if (info->version != IOMMU_CACHE_INVALIDATE_INFO_VERSION_1)
> > +   return -EINVAL;
> > +
> > +   mask = (1 << IOMMU_CACHE_INV_TYPE_NR) - 1;
> > +   if (info->cache & ~mask)
> > +   return -EINVAL;
> > +
> > +   if (info->granularity >= IOMMU_INV_GRANU_NR)
> > +   return -EINVAL;
> > +
> > +   switch (info->granularity) {
> > +   case IOMMU_INV_GRANU_ADDR:
> > +   if (info->cache & IOMMU_CACHE_INV_TYPE_PASID)
> > +   return -EINVAL;
> > +
> > +   mask = IOMMU_INV_ADDR_FLAGS_PASID |
> > +   IOMMU_INV_ADDR_FLAGS_ARCHID |
> > +   IOMMU_INV_ADDR_FLAGS_LEAF;
> > +
> > +   if (info->granu.addr_info.flags & ~mask)
> > +   return -EINVAL;
> > +   break;
> > +   case IOMMU_INV_GRANU_PASID:
> > +   mask = IOMMU_INV_PASID_FLAGS_PASID |
> > +   IOMMU_INV_PASID_FLAGS_ARCHID;
> > +   if (info->granu.pasid_info.flags & ~mask)
> > +   return -EINVAL;
> > +
> > +   break;
> > +   case IOMMU_INV_GRANU_DOMAIN:
> > +   if (info->cache & IOMMU_CACHE_INV_TYPE_DEV_IOTLB)
> > +   return -EINVAL;
> > +   break;
> > +   default:
> > +   return -EINVAL;
> > +   }
> > +
> > +   /* Check reserved padding fields */
> > +   for (i = 0; i < sizeof(info->padding); i++) {
> > +   if (info->padding[i])
> > +   return -EINVAL;
> > +   }
> > +
> > +   return 0;
> > +}
> > +
> >  int iommu_uapi_cache_invalidate(struct iommu_domain *domain, struct device
> *dev,
> > -   struct iommu_cache_invalidate_info *inv_info)
> > +   void __user *uinfo)
> >  {
> > +   st

RE: [PATCH v3 4/4] vfio/type1: Use iommu_aux_at(de)tach_group() APIs

2020-07-30 Thread Liu, Yi L
> From: Lu Baolu 
> Sent: Tuesday, July 14, 2020 1:57 PM
> 
> Replace iommu_aux_at(de)tach_device() with iommu_aux_at(de)tach_group().
> It also saves the IOMMU_DEV_FEAT_AUX-capable physcail device in the vfio_group
> data structure so that it could be reused in other places.
> 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 44 ++---
>  1 file changed, 7 insertions(+), 37 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 5e556ac9102a..f8812e68de77 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -100,6 +100,7 @@ struct vfio_dma {
>  struct vfio_group {
>   struct iommu_group  *iommu_group;
>   struct list_headnext;
> + struct device   *iommu_device;

I know mdev group has only one device, so such a group has a single
iommu_device. But I guess may be helpful to add a comment here or in
commit message. Otherwise, it looks weird that a group structure
contains a single iommu_device field instead of a list of iommu_device.

Regards,
Yi Liu

>   boolmdev_group; /* An mdev group */
>   boolpinned_page_dirty_scope;
>  };
> @@ -1627,45 +1628,13 @@ static struct device
> *vfio_mdev_get_iommu_device(struct device *dev)
>   return NULL;
>  }
> 
> -static int vfio_mdev_attach_domain(struct device *dev, void *data) -{
> - struct iommu_domain *domain = data;
> - struct device *iommu_device;
> -
> - iommu_device = vfio_mdev_get_iommu_device(dev);
> - if (iommu_device) {
> - if (iommu_dev_feature_enabled(iommu_device,
> IOMMU_DEV_FEAT_AUX))
> - return iommu_aux_attach_device(domain, iommu_device);
> - else
> - return iommu_attach_device(domain, iommu_device);
> - }
> -
> - return -EINVAL;
> -}
> -
> -static int vfio_mdev_detach_domain(struct device *dev, void *data) -{
> - struct iommu_domain *domain = data;
> - struct device *iommu_device;
> -
> - iommu_device = vfio_mdev_get_iommu_device(dev);
> - if (iommu_device) {
> - if (iommu_dev_feature_enabled(iommu_device,
> IOMMU_DEV_FEAT_AUX))
> - iommu_aux_detach_device(domain, iommu_device);
> - else
> - iommu_detach_device(domain, iommu_device);
> - }
> -
> - return 0;
> -}
> -
>  static int vfio_iommu_attach_group(struct vfio_domain *domain,
>  struct vfio_group *group)
>  {
>   if (group->mdev_group)
> - return iommu_group_for_each_dev(group->iommu_group,
> - domain->domain,
> - vfio_mdev_attach_domain);
> + return iommu_aux_attach_group(domain->domain,
> +   group->iommu_group,
> +   group->iommu_device);
>   else
>   return iommu_attach_group(domain->domain, group-
> >iommu_group);  } @@ -1674,8 +1643,8 @@ static void
> vfio_iommu_detach_group(struct vfio_domain *domain,
>   struct vfio_group *group)
>  {
>   if (group->mdev_group)
> - iommu_group_for_each_dev(group->iommu_group, domain-
> >domain,
> -  vfio_mdev_detach_domain);
> + iommu_aux_detach_group(domain->domain, group->iommu_group,
> +group->iommu_device);
>   else
>   iommu_detach_group(domain->domain, group->iommu_group);  }
> @@ -2007,6 +1976,7 @@ static int vfio_iommu_type1_attach_group(void
> *iommu_data,
>   return 0;
>   }
> 
> + group->iommu_device = iommu_device;
>   bus = iommu_device->bus;
>   }
> 
> --
> 2.17.1



RE: [PATCH v6 01/15] vfio/type1: Refactor vfio_iommu_type1_ioctl()

2020-07-28 Thread Liu, Yi L
> From: Alex Williamson 
> Sent: Tuesday, July 28, 2020 11:54 PM
> 
> On Mon, 27 Jul 2020 23:27:30 -0700
> Liu Yi L  wrote:
> 
> > This patch refactors the vfio_iommu_type1_ioctl() to use switch
> > instead of if-else, and each command got a helper function.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Reviewed-by: Eric Auger 
> > Suggested-by: Christoph Hellwig 
> > Signed-off-by: Liu Yi L 
> > ---
> 
> FYI, this commit is already in my next branch and linux-next as of today, you 
> can
> drop it from future series.  Thanks,

got it. thanks. :-)

Regards,
Yi Liu

> Alex
> 
> > v4 -> v5:
> > *) address comments from Eric Auger, add r-b from Eric.
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 394
> > ++--
> >  1 file changed, 213 insertions(+), 181 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index 5e556ac..3bd70ff 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -2453,6 +2453,23 @@ static int vfio_domains_have_iommu_cache(struct
> vfio_iommu *iommu)
> > return ret;
> >  }
> >
> > +static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
> > +   unsigned long arg)
> > +{
> > +   switch (arg) {
> > +   case VFIO_TYPE1_IOMMU:
> > +   case VFIO_TYPE1v2_IOMMU:
> > +   case VFIO_TYPE1_NESTING_IOMMU:
> > +   return 1;
> > +   case VFIO_DMA_CC_IOMMU:
> > +   if (!iommu)
> > +   return 0;
> > +   return vfio_domains_have_iommu_cache(iommu);
> > +   default:
> > +   return 0;
> > +   }
> > +}
> > +
> >  static int vfio_iommu_iova_add_cap(struct vfio_info_cap *caps,
> >  struct vfio_iommu_type1_info_cap_iova_range *cap_iovas,
> >  size_t size)
> > @@ -2529,241 +2546,256 @@ static int vfio_iommu_migration_build_caps(struct
> vfio_iommu *iommu,
> > return vfio_info_add_capability(caps, _mig.header,
> > sizeof(cap_mig));  }
> >
> > -static long vfio_iommu_type1_ioctl(void *iommu_data,
> > -  unsigned int cmd, unsigned long arg)
> > +static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
> > +unsigned long arg)
> >  {
> > -   struct vfio_iommu *iommu = iommu_data;
> > +   struct vfio_iommu_type1_info info;
> > unsigned long minsz;
> > +   struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
> > +   unsigned long capsz;
> > +   int ret;
> >
> > -   if (cmd == VFIO_CHECK_EXTENSION) {
> > -   switch (arg) {
> > -   case VFIO_TYPE1_IOMMU:
> > -   case VFIO_TYPE1v2_IOMMU:
> > -   case VFIO_TYPE1_NESTING_IOMMU:
> > -   return 1;
> > -   case VFIO_DMA_CC_IOMMU:
> > -   if (!iommu)
> > -   return 0;
> > -   return vfio_domains_have_iommu_cache(iommu);
> > -   default:
> > -   return 0;
> > -   }
> > -   } else if (cmd == VFIO_IOMMU_GET_INFO) {
> > -   struct vfio_iommu_type1_info info;
> > -   struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
> > -   unsigned long capsz;
> > -   int ret;
> > -
> > -   minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
> > +   minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
> >
> > -   /* For backward compatibility, cannot require this */
> > -   capsz = offsetofend(struct vfio_iommu_type1_info, cap_offset);
> > +   /* For backward compatibility, cannot require this */
> > +   capsz = offsetofend(struct vfio_iommu_type1_info, cap_offset);
> >
> > -   if (copy_from_user(, (void __user *)arg, minsz))
> > -   return -EFAULT;
> > +   if (copy_from_user(, (void __user *)arg, minsz))
> > +   return -EFAULT;
> >
> > -   if (info.argsz < minsz)
> > -   return -EINVAL;
> > +   if (info.argsz < minsz)
> > +   return -EINVAL;
> >
> > -   if (info.argsz >= capsz) {
> > -   minsz = capsz;
> > -   info.cap_offset = 0; /* output

[PATCH v6 04/15] vfio/type1: Report iommu nesting info to userspace

2020-07-28 Thread Liu Yi L
This patch exports iommu nesting capability info to user space through
VFIO. Userspace is expected to check this info for supported uAPIs (e.g.
PASID alloc/free, bind page table, and cache invalidation) and the vendor
specific format information for first level/stage page table that will be
bound to.

The nesting info is available only after container set to be NESTED type.
Current implementation imposes one limitation - one nesting container
should include at most one iommu group. The philosophy of vfio container
is having all groups/devices within the container share the same IOMMU
context. When vSVA is enabled, one IOMMU context could include one 2nd-
level address space and multiple 1st-level address spaces. While the
2nd-level address space is reasonably sharable by multiple groups, blindly
sharing 1st-level address spaces across all groups within the container
might instead break the guest expectation. In the future sub/super container
concept might be introduced to allow partial address space sharing within
an IOMMU context. But for now let's go with this restriction by requiring
singleton container for using nesting iommu features. Below link has the
related discussion about this decision.

https://lore.kernel.org/kvm/20200515115924.37e69...@w520.home/

This patch also changes the NESTING type container behaviour. Something
that would have succeeded before will now fail: Before this series, if
user asked for a VFIO_IOMMU_TYPE1_NESTING, it would have succeeded even
if the SMMU didn't support stage-2, as the driver would have silently
fallen back on stage-1 mappings (which work exactly the same as stage-2
only since there was no nesting supported). After the series, we do check
for DOMAIN_ATTR_NESTING so if user asks for VFIO_IOMMU_TYPE1_NESTING and
the SMMU doesn't support stage-2, the ioctl fails. But it should be a good
fix and completely harmless. Detail can be found in below link as well.

https://lore.kernel.org/kvm/20200717090900.GC4850@myrica/

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
---
v5 -> v6:
*) address comments against v5 from Eric Auger.
*) don't report nesting cap to userspace if the nesting_info->format is
   invalid.

v4 -> v5:
*) address comments from Eric Auger.
*) return struct iommu_nesting_info for VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as
   cap is much "cheap", if needs extension in future, just define another cap.
   https://lore.kernel.org/kvm/20200708132947.5b7ee...@x1.home/

v3 -> v4:
*) address comments against v3.

v1 -> v2:
*) added in v2
---
 drivers/vfio/vfio_iommu_type1.c | 106 +++-
 include/uapi/linux/vfio.h   |  19 +++
 2 files changed, 113 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 3bd70ff..18ff0c3 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -62,18 +62,20 @@ MODULE_PARM_DESC(dma_entry_limit,
 "Maximum number of user DMA mappings per container (65535).");
 
 struct vfio_iommu {
-   struct list_headdomain_list;
-   struct list_headiova_list;
-   struct vfio_domain  *external_domain; /* domain for external user */
-   struct mutexlock;
-   struct rb_root  dma_list;
-   struct blocking_notifier_head notifier;
-   unsigned intdma_avail;
-   uint64_tpgsize_bitmap;
-   boolv2;
-   boolnesting;
-   booldirty_page_tracking;
-   boolpinned_page_dirty_scope;
+   struct list_headdomain_list;
+   struct list_headiova_list;
+   /* domain for external user */
+   struct vfio_domain  *external_domain;
+   struct mutexlock;
+   struct rb_root  dma_list;
+   struct blocking_notifier_head   notifier;
+   unsigned intdma_avail;
+   uint64_tpgsize_bitmap;
+   boolv2;
+   boolnesting;
+   booldirty_page_tracking;
+   boolpinned_page_dirty_scope;
+   struct iommu_nesting_info   *nesting_info;
 };
 
 struct vfio_domain {
@@ -130,6 +132,9 @@ struct vfio_regions {
 #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)\
(!list_empty(>domain_list))
 
+#define CONTAINER_HAS_DOMAIN(iommu)(((iommu)->external_domain) || \
+(!list_empty(&(iommu)->domain_list)))
+
 #define DIRTY_BITMAP_BYTES(n)  (ALIGN(n, BITS_PER_TYPE(u64)) / BITS_PER_BYTE)
 
 /*
@@ -1929,6 +1934,13 @@ static void vfio_

[PATCH v6 06/15] iommu/vt-d: Support setting ioasid set to domain

2020-07-28 Thread Liu Yi L
>From IOMMU p.o.v., PASIDs allocated and managed by external components
(e.g. VFIO) will be passed in for gpasid_bind/unbind operation. IOMMU
needs some knowledge to check the PASID ownership, hence add an interface
for those components to tell the PASID owner.

In latest kernel design, PASID ownership is managed by IOASID set where
the PASID is allocated from. This patch adds support for setting ioasid
set ID to the domains used for nesting/vSVA. Subsequent SVA operations
will check the PASID against its IOASID set for proper ownership.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
Signed-off-by: Jacob Pan 
---
v5 -> v6:
*) address comments against v5 from Eric Auger.

v4 -> v5:
*) address comments from Eric Auger.
---
 drivers/iommu/intel/iommu.c | 23 +++
 include/linux/intel-iommu.h |  4 
 include/linux/iommu.h   |  1 +
 3 files changed, 28 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ed4b71c..b2fe54e 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1793,6 +1793,7 @@ static struct dmar_domain *alloc_domain(int flags)
if (first_level_by_default())
domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
domain->has_iotlb_device = false;
+   domain->ioasid_sid = INVALID_IOASID_SET;
INIT_LIST_HEAD(>devices);
 
return domain;
@@ -6040,6 +6041,28 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain,
}
spin_unlock_irqrestore(_domain_lock, flags);
break;
+   case DOMAIN_ATTR_IOASID_SID:
+   {
+   int sid = *(int *)data;
+
+   spin_lock_irqsave(_domain_lock, flags);
+   if (!(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE)) {
+   ret = -ENODEV;
+   spin_unlock_irqrestore(_domain_lock, flags);
+   break;
+   }
+   if (dmar_domain->ioasid_sid != INVALID_IOASID_SET &&
+   dmar_domain->ioasid_sid != sid) {
+   pr_warn_ratelimited("multi ioasid_set (%d:%d) setting",
+   dmar_domain->ioasid_sid, sid);
+   ret = -EBUSY;
+   spin_unlock_irqrestore(_domain_lock, flags);
+   break;
+   }
+   dmar_domain->ioasid_sid = sid;
+   spin_unlock_irqrestore(_domain_lock, flags);
+   break;
+   }
default:
ret = -EINVAL;
break;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 3f23c26..0d0ab32 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -549,6 +549,10 @@ struct dmar_domain {
   2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
u64 max_addr;   /* maximum mapped address */
 
+   int ioasid_sid; /*
+* the ioasid set which tracks all
+* PASIDs used by the domain.
+*/
int default_pasid;  /*
 * The default pasid used for non-SVM
 * traffic on mediated devices.
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 4a02c9e..b1ff702 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -124,6 +124,7 @@ enum iommu_attr {
DOMAIN_ATTR_FSL_PAMUV1,
DOMAIN_ATTR_NESTING,/* two stages of translation */
DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE,
+   DOMAIN_ATTR_IOASID_SID,
DOMAIN_ATTR_MAX,
 };
 
-- 
2.7.4



[PATCH v6 05/15] vfio: Add PASID allocation/free support

2020-07-28 Thread Liu Yi L
Shared Virtual Addressing (a.k.a Shared Virtual Memory) allows sharing
multiple process virtual address spaces with the device for simplified
programming model. PASID is used to tag an virtual address space in DMA
requests and to identify the related translation structure in IOMMU. When
a PASID-capable device is assigned to a VM, we want the same capability
of using PASID to tag guest process virtual address spaces to achieve
virtual SVA (vSVA).

PASID management for guest is vendor specific. Some vendors (e.g. Intel
VT-d) requires system-wide managed PASIDs across all devices, regardless
of whether a device is used by host or assigned to guest. Other vendors
(e.g. ARM SMMU) may allow PASIDs managed per-device thus could be fully
delegated to the guest for assigned devices.

For system-wide managed PASIDs, this patch introduces a vfio module to
handle explicit PASID alloc/free requests from guest. Allocated PASIDs
are associated to a process (or, mm_struct) in IOASID core. A vfio_mm
object is introduced to track mm_struct. Multiple VFIO containers within
a process share the same vfio_mm object.

A quota mechanism is provided to prevent malicious user from exhausting
available PASIDs. Currently the quota is a global parameter applied to
all VFIO devices. In the future per-device quota might be supported too.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Suggested-by: Alex Williamson 
Signed-off-by: Liu Yi L 
---
v5 -> v6:
*) address comments from Eric. Add vfio_unlink_pasid() to be consistent
   with vfio_unlink_dma(). Add a comment in vfio_pasid_exit().

v4 -> v5:
*) address comments from Eric Auger.
*) address the comments from Alex on the pasid free range support. Added
   per vfio_mm pasid r-b tree.
   https://lore.kernel.org/kvm/20200709082751.32074...@x1.home/

v3 -> v4:
*) fix lock leam in vfio_mm_get_from_task()
*) drop pasid_quota field in struct vfio_mm
*) vfio_mm_get_from_task() returns ERR_PTR(-ENOTTY) when !CONFIG_VFIO_PASID

v1 -> v2:
*) added in v2, split from the pasid alloc/free support of v1
---
 drivers/vfio/Kconfig  |   5 +
 drivers/vfio/Makefile |   1 +
 drivers/vfio/vfio_pasid.c | 248 ++
 include/linux/vfio.h  |  28 ++
 4 files changed, 282 insertions(+)
 create mode 100644 drivers/vfio/vfio_pasid.c

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index fd17db9..3d8a108 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -19,6 +19,11 @@ config VFIO_VIRQFD
depends on VFIO && EVENTFD
default n
 
+config VFIO_PASID
+   tristate
+   depends on IOASID && VFIO
+   default n
+
 menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
depends on IOMMU_API
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index de67c47..bb836a3 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -3,6 +3,7 @@ vfio_virqfd-y := virqfd.o
 
 obj-$(CONFIG_VFIO) += vfio.o
 obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
+obj-$(CONFIG_VFIO_PASID) += vfio_pasid.o
 obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
 obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c
new file mode 100644
index 000..befcf29
--- /dev/null
+++ b/drivers/vfio/vfio_pasid.c
@@ -0,0 +1,248 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Intel Corporation.
+ * Author: Liu Yi L 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Liu Yi L "
+#define DRIVER_DESC "PASID management for VFIO bus drivers"
+
+#define VFIO_DEFAULT_PASID_QUOTA   1000
+static int pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
+module_param_named(pasid_quota, pasid_quota, uint, 0444);
+MODULE_PARM_DESC(pasid_quota,
+"Set the quota for max number of PASIDs that an application is 
allowed to request (default 1000)");
+
+struct vfio_mm_token {
+   unsigned long long val;
+};
+
+struct vfio_mm {
+   struct kref kref;
+   int ioasid_sid;
+   struct mutexpasid_lock;
+   struct rb_root  pasid_list;
+   struct list_headnext;
+   struct vfio_mm_tokentoken;
+};
+
+static struct mutexvfio_mm_lock;
+static struct list_headvfio_mm_list;
+
+struct vfio_pasid {
+   struct rb_node  node;
+   ioasid_tpasid;
+};
+
+static void vfio_remove_all_pasids(struct vfio_mm *vmm);
+
+/* called with vfio.vfio_mm_lock held */
+static void vfio_mm_release(struct kref *kref)
+{
+   struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
+
+   list_del(>next);
+   mutex_unlock(_mm_lo

[PATCH v6 13/15] vfio/pci: Expose PCIe PASID capability to guest

2020-07-28 Thread Liu Yi L
This patch exposes PCIe PASID capability to guest for assigned devices.
Existing vfio_pci driver hides it from guest by setting the capability
length as 0 in pci_ext_cap_length[].

And this patch only exposes PASID capability for devices which has PCIe
PASID extended struture in its configuration space. VFs will not expose
the PASID capability as they do not implement the PASID extended structure
in their config space. It is a TODO in future. Related discussion can be
found in below link:

https://lore.kernel.org/kvm/20200407095801.648b1...@w520.home/

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Reviewed-by: Eric Auger 
Signed-off-by: Liu Yi L 
---
v5 -> v6:
*) add review-by from Eric Auger.

v1 -> v2:
*) added in v2, but it was sent in a separate patchseries before
---
 drivers/vfio/pci/vfio_pci_config.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index d98843f..07ff2e6 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -95,7 +95,7 @@ static const u16 pci_ext_cap_length[PCI_EXT_CAP_ID_MAX + 1] = 
{
[PCI_EXT_CAP_ID_LTR]=   PCI_EXT_CAP_LTR_SIZEOF,
[PCI_EXT_CAP_ID_SECPCI] =   0,  /* not yet */
[PCI_EXT_CAP_ID_PMUX]   =   0,  /* not yet */
-   [PCI_EXT_CAP_ID_PASID]  =   0,  /* not yet */
+   [PCI_EXT_CAP_ID_PASID]  =   PCI_EXT_CAP_PASID_SIZEOF,
 };
 
 /*
-- 
2.7.4



[PATCH v6 09/15] iommu/vt-d: Check ownership for PASIDs from user-space

2020-07-28 Thread Liu Yi L
When an IOMMU domain with nesting attribute is used for guest SVA, a
system-wide PASID is allocated for binding with the device and the domain.
For security reason, we need to check the PASID passed from user-space.
e.g. page table bind/unbind and PASID related cache invalidation.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
Signed-off-by: Jacob Pan 
---
 drivers/iommu/intel/iommu.c | 10 ++
 drivers/iommu/intel/svm.c   |  7 +--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index b2fe54e..88f4647 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5436,6 +5436,7 @@ intel_iommu_sva_invalidate(struct iommu_domain *domain, 
struct device *dev,
int granu = 0;
u64 pasid = 0;
u64 addr = 0;
+   void *pdata;
 
granu = to_vtd_granularity(cache_type, inv_info->granularity);
if (granu == -EINVAL) {
@@ -5456,6 +5457,15 @@ intel_iommu_sva_invalidate(struct iommu_domain *domain, 
struct device *dev,
 (inv_info->granu.addr_info.flags & 
IOMMU_INV_ADDR_FLAGS_PASID))
pasid = inv_info->granu.addr_info.pasid;
 
+   pdata = ioasid_find(dmar_domain->ioasid_sid, pasid, NULL);
+   if (!pdata) {
+   ret = -EINVAL;
+   goto out_unlock;
+   } else if (IS_ERR(pdata)) {
+   ret = PTR_ERR(pdata);
+   goto out_unlock;
+   }
+
switch (BIT(cache_type)) {
case IOMMU_CACHE_INV_TYPE_IOTLB:
/* HW will ignore LSB bits based on address mask */
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index c85b8d5..b9b29ad 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -323,7 +323,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
struct device *dev,
dmar_domain = to_dmar_domain(domain);
 
mutex_lock(_mutex);
-   svm = ioasid_find(INVALID_IOASID_SET, data->hpasid, NULL);
+   svm = ioasid_find(dmar_domain->ioasid_sid, data->hpasid, NULL);
if (IS_ERR(svm)) {
ret = PTR_ERR(svm);
goto out;
@@ -440,6 +440,7 @@ int intel_svm_unbind_gpasid(struct iommu_domain *domain,
struct device *dev, u32 pasid)
 {
struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
+   struct dmar_domain *dmar_domain;
struct intel_svm_dev *sdev;
struct intel_svm *svm;
int ret = -EINVAL;
@@ -447,8 +448,10 @@ int intel_svm_unbind_gpasid(struct iommu_domain *domain,
if (WARN_ON(!iommu))
return -EINVAL;
 
+   dmar_domain = to_dmar_domain(domain);
+
mutex_lock(_mutex);
-   svm = ioasid_find(INVALID_IOASID_SET, pasid, NULL);
+   svm = ioasid_find(dmar_domain->ioasid_sid, pasid, NULL);
if (!svm) {
ret = -EINVAL;
goto out;
-- 
2.7.4



[PATCH v6 03/15] iommu/smmu: Report empty domain nesting info

2020-07-28 Thread Liu Yi L
This patch is added as instead of returning a boolean for DOMAIN_ATTR_NESTING,
iommu_domain_get_attr() should return an iommu_nesting_info handle. For
now, return an empty nesting info struct for now as true nesting is not
yet supported by the SMMUs.

Cc: Will Deacon 
Cc: Robin Murphy 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Reviewed-by: Eric Auger 
Suggested-by: Jean-Philippe Brucker 
Signed-off-by: Liu Yi L 
Signed-off-by: Jacob Pan 
---
v5 -> v6:
*) add review-by from Eric Auger.

v4 -> v5:
*) address comments from Eric Auger.
---
 drivers/iommu/arm-smmu-v3.c | 29 +++--
 drivers/iommu/arm-smmu.c| 29 +++--
 2 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677..c702ef9 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -3019,6 +3019,32 @@ static struct iommu_group *arm_smmu_device_group(struct 
device *dev)
return group;
 }
 
+static int arm_smmu_domain_nesting_info(struct arm_smmu_domain *smmu_domain,
+   void *data)
+{
+   struct iommu_nesting_info *info = (struct iommu_nesting_info *)data;
+   unsigned int size;
+
+   if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+   return -ENODEV;
+
+   size = sizeof(struct iommu_nesting_info);
+
+   /*
+* if provided buffer size is smaller than expected, should
+* return 0 and also the expected buffer size to caller.
+*/
+   if (info->argsz < size) {
+   info->argsz = size;
+   return 0;
+   }
+
+   /* report an empty iommu_nesting_info for now */
+   memset(info, 0x0, size);
+   info->argsz = size;
+   return 0;
+}
+
 static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
enum iommu_attr attr, void *data)
 {
@@ -3028,8 +3054,7 @@ static int arm_smmu_domain_get_attr(struct iommu_domain 
*domain,
case IOMMU_DOMAIN_UNMANAGED:
switch (attr) {
case DOMAIN_ATTR_NESTING:
-   *(int *)data = (smmu_domain->stage == 
ARM_SMMU_DOMAIN_NESTED);
-   return 0;
+   return arm_smmu_domain_nesting_info(smmu_domain, data);
default:
return -ENODEV;
}
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 243bc4c..2bd58f4 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1506,6 +1506,32 @@ static struct iommu_group *arm_smmu_device_group(struct 
device *dev)
return group;
 }
 
+static int arm_smmu_domain_nesting_info(struct arm_smmu_domain *smmu_domain,
+   void *data)
+{
+   struct iommu_nesting_info *info = (struct iommu_nesting_info *)data;
+   unsigned int size;
+
+   if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+   return -ENODEV;
+
+   size = sizeof(struct iommu_nesting_info);
+
+   /*
+* if provided buffer size is smaller than expected, should
+* return 0 and also the expected buffer size to caller.
+*/
+   if (info->argsz < size) {
+   info->argsz = size;
+   return 0;
+   }
+
+   /* report an empty iommu_nesting_info for now */
+   memset(info, 0x0, size);
+   info->argsz = size;
+   return 0;
+}
+
 static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
enum iommu_attr attr, void *data)
 {
@@ -1515,8 +1541,7 @@ static int arm_smmu_domain_get_attr(struct iommu_domain 
*domain,
case IOMMU_DOMAIN_UNMANAGED:
switch (attr) {
case DOMAIN_ATTR_NESTING:
-   *(int *)data = (smmu_domain->stage == 
ARM_SMMU_DOMAIN_NESTED);
-   return 0;
+   return arm_smmu_domain_nesting_info(smmu_domain, data);
default:
return -ENODEV;
}
-- 
2.7.4



[PATCH v6 14/15] vfio: Document dual stage control

2020-07-28 Thread Liu Yi L
From: Eric Auger 

The VFIO API was enhanced to support nested stage control: a bunch of
new ioctls and usage guideline.

Let's document the process to follow to set up nested mode.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Eric Auger 
Signed-off-by: Liu Yi L 
---
v5 -> v6:
*) tweak per Eric's comments.

v3 -> v4:
*) add review-by from Stefan Hajnoczi

v2 -> v3:
*) address comments from Stefan Hajnoczi

v1 -> v2:
*) new in v2, compared with Eric's original version, pasid table bind
   and fault reporting is removed as this series doesn't cover them.
   Original version from Eric.
   https://lkml.org/lkml/2020/3/20/700
---
 Documentation/driver-api/vfio.rst | 75 +++
 1 file changed, 75 insertions(+)

diff --git a/Documentation/driver-api/vfio.rst 
b/Documentation/driver-api/vfio.rst
index f1a4d3c..c0d43f0 100644
--- a/Documentation/driver-api/vfio.rst
+++ b/Documentation/driver-api/vfio.rst
@@ -239,6 +239,81 @@ group and can access them as follows::
/* Gratuitous device reset and go... */
ioctl(device, VFIO_DEVICE_RESET);
 
+IOMMU Dual Stage Control
+
+
+Some IOMMUs support 2 stages/levels of translation. Stage corresponds
+to the ARM terminology while level corresponds to Intel's terminology.
+In the following text we use either without distinction.
+
+This is useful when the guest is exposed with a virtual IOMMU and some
+devices are assigned to the guest through VFIO. Then the guest OS can
+use stage-1 (GIOVA -> GPA or GVA->GPA), while the hypervisor uses stage
+2 for VM isolation (GPA -> HPA).
+
+Under dual stage translation, the guest gets ownership of the stage-1 page
+tables and also owns stage-1 configuration structures. The hypervisor owns
+the root configuration structure (for security reason), including stage-2
+configuration. This works as long as configuration structures and page table
+formats are compatible between the virtual IOMMU and the physical IOMMU.
+
+Assuming the HW supports it, this nested mode is selected by choosing the
+VFIO_TYPE1_NESTING_IOMMU type through:
+
+ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU);
+
+This forces the hypervisor to use the stage-2, leaving stage-1 available
+for guest usage. The stage-1 format and binding method are vendor specific
+and reported in nesting cap (VFIO_IOMMU_TYPE1_INFO_CAP_NESTING) through
+VFIO_IOMMU_GET_INFO:
+
+ioctl(container->fd, VFIO_IOMMU_GET_INFO, _info);
+
+The nesting cap info is available only after NESTING_IOMMU is selected.
+If underlying IOMMU doesn't support nesting, VFIO_SET_IOMMU fails and
+userspace should try other IOMMU types. Details of the nesting cap info
+can be found in Documentation/userspace-api/iommu.rst.
+
+The stage-1 page table can be bound to the IOMMU in two methods: directly
+or indirectly. Direct binding requires userspace to notify VFIO of every
+guest stage-1 page table binding, while indirect binding allows userspace
+to bind once with an intermediate structure (e.g. PASID table) which
+indirectly links to guest stage-1 page tables. The actual binding method
+depends on IOMMU vendor. Currently only the direct binding capability (
+IOMMU_NESTING_FEAT_BIND_PGTBL) is supported:
+
+nesting_op->flags = VFIO_IOMMU_NESTING_OP_BIND_PGTBL;
+memcpy(_op->data, _data, sizeof(bind_data));
+ioctl(container->fd, VFIO_IOMMU_NESTING_OP, nesting_op);
+
+When multiple stage-1 page tables are supported on a device, each page
+table is associated with a PASID (Process Address Space ID) to differentiate
+with each other. In such case, userspace should include PASID in the
+bind_data when issuing direct binding request.
+
+PASID could be managed per-device or system-wide which, again, depends on
+IOMMU vendor and is reported in nesting cap info. When system-wide policy
+is reported (IOMMU_NESTING_FEAT_SYSWIDE_PASID), e.g. as by Intel platforms,
+userspace *must* allocate PASID from VFIO before attempting binding of
+stage-1 page table:
+
+req.flags = VFIO_IOMMU_ALLOC_PASID;
+ioctl(container, VFIO_IOMMU_PASID_REQUEST, );
+
+Once the stage-1 page table is bound to the IOMMU, the guest is allowed to
+fully manage its mapping at its disposal. The IOMMU walks nested stage-1
+and stage-2 page tables when serving DMA requests from assigned device, and
+may cache the stage-1 mapping in the IOTLB. When required (IOMMU_NESTING_
+FEAT_CACHE_INVLD), userspace *must* forward guest stage-1 invalidation to
+the host, so the IOTLB is invalidated:
+
+nesting_op->flags = VFIO_IOMMU_NESTING_OP_CACHE_INVLD;
+memcpy(_op->data, _inv_data, sizeof(cache_inv_data));
+ioctl(container->fd, VFIO_IOMMU_NESTING_OP, nesting_op);
+
+Forwarded invalidations can happen at various granularity levels (page
+le

[PATCH v6 15/15] iommu/vt-d: Support reporting nesting capability info

2020-07-28 Thread Liu Yi L
This patch reports nesting info, and only supports the case where all
the physical iomms have the same CAP/ECAP MASKS.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
Signed-off-by: Jacob Pan 
---
v2 -> v3:
*) remove cap/ecap_mask in iommu_nesting_info.
---
 drivers/iommu/intel/iommu.c | 81 +++--
 include/linux/intel-iommu.h | 16 +
 2 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 88f4647..0835804 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5660,12 +5660,16 @@ static inline bool iommu_pasid_support(void)
 static inline bool nested_mode_support(void)
 {
struct dmar_drhd_unit *drhd;
-   struct intel_iommu *iommu;
+   struct intel_iommu *iommu, *prev = NULL;
bool ret = true;
 
rcu_read_lock();
for_each_active_iommu(iommu, drhd) {
-   if (!sm_supported(iommu) || !ecap_nest(iommu->ecap)) {
+   if (!prev)
+   prev = iommu;
+   if (!sm_supported(iommu) || !ecap_nest(iommu->ecap) ||
+   (VTD_CAP_MASK & (iommu->cap ^ prev->cap)) ||
+   (VTD_ECAP_MASK & (iommu->ecap ^ prev->ecap))) {
ret = false;
break;
}
@@ -6081,6 +6085,78 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain,
return ret;
 }
 
+static int intel_iommu_get_nesting_info(struct iommu_domain *domain,
+   struct iommu_nesting_info *info)
+{
+   struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+   u64 cap = VTD_CAP_MASK, ecap = VTD_ECAP_MASK;
+   struct device_domain_info *domain_info;
+   struct iommu_nesting_info_vtd vtd;
+   unsigned long flags;
+   unsigned int size;
+
+   if (domain->type != IOMMU_DOMAIN_UNMANAGED ||
+   !(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE))
+   return -ENODEV;
+
+   if (!info)
+   return -EINVAL;
+
+   size = sizeof(struct iommu_nesting_info) +
+   sizeof(struct iommu_nesting_info_vtd);
+   /*
+* if provided buffer size is smaller than expected, should
+* return 0 and also the expected buffer size to caller.
+*/
+   if (info->argsz < size) {
+   info->argsz = size;
+   return 0;
+   }
+
+   spin_lock_irqsave(_domain_lock, flags);
+   /*
+* arbitrary select the first domain_info as all nesting
+* related capabilities should be consistent across iommu
+* units.
+*/
+   domain_info = list_first_entry(_domain->devices,
+  struct device_domain_info, link);
+   cap &= domain_info->iommu->cap;
+   ecap &= domain_info->iommu->ecap;
+   spin_unlock_irqrestore(_domain_lock, flags);
+
+   info->format = IOMMU_PASID_FORMAT_INTEL_VTD;
+   info->features = IOMMU_NESTING_FEAT_SYSWIDE_PASID |
+IOMMU_NESTING_FEAT_BIND_PGTBL |
+IOMMU_NESTING_FEAT_CACHE_INVLD;
+   info->addr_width = dmar_domain->gaw;
+   info->pasid_bits = ilog2(intel_pasid_max_id);
+   info->padding = 0;
+   vtd.flags = 0;
+   vtd.padding = 0;
+   vtd.cap_reg = cap;
+   vtd.ecap_reg = ecap;
+
+   memcpy(info->data, , sizeof(vtd));
+   return 0;
+}
+
+static int intel_iommu_domain_get_attr(struct iommu_domain *domain,
+  enum iommu_attr attr, void *data)
+{
+   switch (attr) {
+   case DOMAIN_ATTR_NESTING:
+   {
+   struct iommu_nesting_info *info =
+   (struct iommu_nesting_info *)data;
+
+   return intel_iommu_get_nesting_info(domain, info);
+   }
+   default:
+   return -ENOENT;
+   }
+}
+
 /*
  * Check that the device does not live on an external facing PCI port that is
  * marked as untrusted. Such devices should not be able to apply quirks and
@@ -6103,6 +6179,7 @@ const struct iommu_ops intel_iommu_ops = {
.domain_alloc   = intel_iommu_domain_alloc,
.domain_free= intel_iommu_domain_free,
.domain_set_attr= intel_iommu_domain_set_attr,
+   .domain_get_attr= intel_iommu_domain_get_attr,
.attach_dev = intel_iommu_attach_device,
.detach_dev = intel_iommu_detach_device,
.aux_attach_dev = intel_iommu_aux_attach_device,
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index f98146b..5acf795 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -197,6 +197,22 @@
 #defi

[PATCH v6 07/15] vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free)

2020-07-28 Thread Liu Yi L
This patch allows userspace to request PASID allocation/free, e.g. when
serving the request from the guest.

PASIDs that are not freed by userspace are automatically freed when the
IOASID set is destroyed when process exits.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
Signed-off-by: Yi Sun 
Signed-off-by: Jacob Pan 
---
v5 -> v6:
*) address comments from Eric against v5. remove the alloc/free helper.

v4 -> v5:
*) address comments from Eric Auger.
*) the comments for the PASID_FREE request is addressed in patch 5/15 of
   this series.

v3 -> v4:
*) address comments from v3, except the below comment against the range
   of PASID_FREE request. needs more help on it.
"> +if (req.range.min > req.range.max)

 Is it exploitable that a user can spin the kernel for a long time in
 the case of a free by calling this with [0, MAX_UINT] regardless of
 their actual allocations?"
https://lore.kernel.org/linux-iommu/20200702151832.048b4...@x1.home/

v1 -> v2:
*) move the vfio_mm related code to be a seprate module
*) use a single structure for alloc/free, could support a range of PASIDs
*) fetch vfio_mm at group_attach time instead of at iommu driver open time
---
 drivers/vfio/Kconfig|  1 +
 drivers/vfio/vfio_iommu_type1.c | 69 +
 drivers/vfio/vfio_pasid.c   | 10 ++
 include/linux/vfio.h|  6 
 include/uapi/linux/vfio.h   | 37 ++
 5 files changed, 123 insertions(+)

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 3d8a108..95d90c6 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -2,6 +2,7 @@
 config VFIO_IOMMU_TYPE1
tristate
depends on VFIO
+   select VFIO_PASID if (X86)
default n
 
 config VFIO_IOMMU_SPAPR_TCE
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 18ff0c3..ea89c7c 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -76,6 +76,7 @@ struct vfio_iommu {
booldirty_page_tracking;
boolpinned_page_dirty_scope;
struct iommu_nesting_info   *nesting_info;
+   struct vfio_mm  *vmm;
 };
 
 struct vfio_domain {
@@ -1937,6 +1938,11 @@ static void vfio_iommu_iova_insert_copy(struct 
vfio_iommu *iommu,
 
 static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
 {
+   if (iommu->vmm) {
+   vfio_mm_put(iommu->vmm);
+   iommu->vmm = NULL;
+   }
+
kfree(iommu->nesting_info);
iommu->nesting_info = NULL;
 }
@@ -2071,6 +2077,26 @@ static int vfio_iommu_type1_attach_group(void 
*iommu_data,
iommu->nesting_info);
if (ret)
goto out_detach;
+
+   if (iommu->nesting_info->features &
+   IOMMU_NESTING_FEAT_SYSWIDE_PASID) {
+   struct vfio_mm *vmm;
+   int sid;
+
+   vmm = vfio_mm_get_from_task(current);
+   if (IS_ERR(vmm)) {
+   ret = PTR_ERR(vmm);
+   goto out_detach;
+   }
+   iommu->vmm = vmm;
+
+   sid = vfio_mm_ioasid_sid(vmm);
+   ret = iommu_domain_set_attr(domain->domain,
+   DOMAIN_ATTR_IOASID_SID,
+   );
+   if (ret)
+   goto out_detach;
+   }
}
 
/* Get aperture info */
@@ -2859,6 +2885,47 @@ static int vfio_iommu_type1_dirty_pages(struct 
vfio_iommu *iommu,
return -EINVAL;
 }
 
+static int vfio_iommu_type1_pasid_request(struct vfio_iommu *iommu,
+ unsigned long arg)
+{
+   struct vfio_iommu_type1_pasid_request req;
+   unsigned long minsz;
+   int ret;
+
+   minsz = offsetofend(struct vfio_iommu_type1_pasid_request, range);
+
+   if (copy_from_user(, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (req.argsz < minsz || (req.flags & ~VFIO_PASID_REQUEST_MASK))
+   return -EINVAL;
+
+   if (req.range.min > req.range.max)
+   return -EINVAL;
+
+   mutex_lock(>lock);
+   if (!iommu->vmm) {
+   mutex_unlock(>lock);
+   return -EOPNOTSUPP;
+   }
+
+   switch (req.flags & VFIO_PASID_REQUEST_MASK) {
+   case VFIO_IOMMU_FLAG_ALLOC_PASID:
+   ret = vfio_pasid_alloc(iommu->vmm, req.range.min,
+

[PATCH v6 02/15] iommu: Report domain nesting info

2020-07-28 Thread Liu Yi L
IOMMUs that support nesting translation needs report the capability info
to userspace. It gives information about requirements the userspace needs
to implement plus other features characterizing the physical implementation.

This patch reports nesting info by DOMAIN_ATTR_NESTING. Caller can get
nesting info after setting DOMAIN_ATTR_NESTING. For VFIO, it is after
selecting VFIO_TYPE1_NESTING_IOMMU.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
Signed-off-by: Jacob Pan 
---
v5 -> v6:
*) rephrase the feature notes per comments from Eric Auger.
*) rename @size of struct iommu_nesting_info to @argsz.

v4 -> v5:
*) address comments from Eric Auger.

v3 -> v4:
*) split the SMMU driver changes to be a separate patch
*) move the @addr_width and @pasid_bits from vendor specific
   part to generic part.
*) tweak the description for the @features field of struct
   iommu_nesting_info.
*) add description on the @data[] field of struct iommu_nesting_info

v2 -> v3:
*) remvoe cap/ecap_mask in iommu_nesting_info.
*) reuse DOMAIN_ATTR_NESTING to get nesting info.
*) return an empty iommu_nesting_info for SMMU drivers per Jean'
   suggestion.
---
 include/uapi/linux/iommu.h | 74 ++
 1 file changed, 74 insertions(+)

diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 7c8e075..5e4745a 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -332,4 +332,78 @@ struct iommu_gpasid_bind_data {
} vendor;
 };
 
+/*
+ * struct iommu_nesting_info - Information for nesting-capable IOMMU.
+ *userspace should check it before using
+ *nesting capability.
+ *
+ * @argsz: size of the whole structure.
+ * @flags: currently reserved for future extension. must set to 0.
+ * @format:PASID table entry format, the same definition as struct
+ * iommu_gpasid_bind_data @format.
+ * @features:  supported nesting features.
+ * @addr_width:The output addr width of first level/stage translation
+ * @pasid_bits:Maximum supported PASID bits, 0 represents no PASID
+ * support.
+ * @data:  vendor specific cap info. data[] structure type can be deduced
+ * from @format field.
+ *
+ * +===+==+
+ * | feature   |  Notes   |
+ * +===+==+
+ * | SYSWIDE_PASID |  IOMMU vendor driver sets it to mandate userspace|
+ * |   |  to allocate PASID from kernel. All PASID allocation |
+ * |   |  free must be mediated through the TBD API.  |
+ * +---+--+
+ * | BIND_PGTBL|  IOMMU vendor driver sets it to mandate userspace|
+ * |   |  bind the first level/stage page table to associated |
+ * |   |  PASID (either the one specified in bind request or  |
+ * |   |  the default PASID of iommu domain), through IOMMU   |
+ * |   |  UAPI.   |
+ * +---+--+
+ * | CACHE_INVLD   |  IOMMU vendor driver sets it to mandate userspace|
+ * |   |  explicitly invalidate the IOMMU cache through IOMMU |
+ * |   |  UAPI according to vendor-specific requirement when  |
+ * |   |  changing the 1st level/stage page table.|
+ * +---+--+
+ *
+ * @data[] types defined for @format:
+ * ++=+
+ * | @format| @data[] |
+ * ++=+
+ * | IOMMU_PASID_FORMAT_INTEL_VTD   | struct iommu_nesting_info_vtd   |
+ * ++-+
+ *
+ */
+struct iommu_nesting_info {
+   __u32   argsz;
+   __u32   flags;
+   __u32   format;
+#define IOMMU_NESTING_FEAT_SYSWIDE_PASID   (1 << 0)
+#define IOMMU_NESTING_FEAT_BIND_PGTBL  (1 << 1)
+#define IOMMU_NESTING_FEAT_CACHE_INVLD (1 << 2)
+   __u32   features;
+   __u16   addr_width;
+   __u16   pasid_bits;
+   __u32   padding;
+   __u8data[];
+};
+
+/*
+ * struct iommu_nesting_info_vtd - Intel VT-d specific nesting info.
+ *
+ * @flags: VT-d specific flags. Currently reserved for future
+ * extension. must be set to 0.
+ * @cap_reg:   Describe basic capabilities as defined in VT-d capability
+ * register.
+ * @ecap_reg:  Describe the extended c

[PATCH v6 10/15] vfio/type1: Support binding guest page tables to PASID

2020-07-28 Thread Liu Yi L
Nesting translation allows two-levels/stages page tables, with 1st level
for guest translations (e.g. GVA->GPA), 2nd level for host translations
(e.g. GPA->HPA). This patch adds interface for binding guest page tables
to a PASID. This PASID must have been allocated by the userspace before
the binding request.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Jean-Philippe Brucker 
Signed-off-by: Liu Yi L 
Signed-off-by: Jacob Pan 
---
v5 -> v6:
*) dropped vfio_find_nesting_group() and add vfio_get_nesting_domain_capsule().
   per comment from Eric.
*) use iommu_uapi_sva_bind/unbind_gpasid() and iommu_sva_unbind_gpasid() in
   linux/iommu.h for userspace operation and in-kernel operation.

v3 -> v4:
*) address comments from Alex on v3

v2 -> v3:
*) use __iommu_sva_unbind_gpasid() for unbind call issued by VFIO
   
https://lore.kernel.org/linux-iommu/1592931837-58223-6-git-send-email-jacob.jun@linux.intel.com/

v1 -> v2:
*) rename subject from "vfio/type1: Bind guest page tables to host"
*) remove VFIO_IOMMU_BIND, introduce VFIO_IOMMU_NESTING_OP to support bind/
   unbind guet page table
*) replaced vfio_iommu_for_each_dev() with a group level loop since this
   series enforces one group per container w/ nesting type as start.
*) rename vfio_bind/unbind_gpasid_fn() to vfio_dev_bind/unbind_gpasid_fn()
*) vfio_dev_unbind_gpasid() always successful
*) use vfio_mm->pasid_lock to avoid race between PASID free and page table
   bind/unbind
---
 drivers/vfio/vfio_iommu_type1.c | 161 
 drivers/vfio/vfio_pasid.c   |  26 +++
 include/linux/vfio.h|  20 +
 include/uapi/linux/vfio.h   |  31 
 4 files changed, 238 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index ea89c7c..245436e 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -149,6 +149,36 @@ struct vfio_regions {
 #define DIRTY_BITMAP_PAGES_MAX  ((u64)INT_MAX)
 #define DIRTY_BITMAP_SIZE_MAX   DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
 
+struct domain_capsule {
+   struct vfio_group *group;
+   struct iommu_domain *domain;
+   void *data;
+};
+
+/* iommu->lock must be held */
+static int vfio_get_nesting_domain_capsule(struct vfio_iommu *iommu,
+  struct domain_capsule *dc)
+{
+   struct vfio_domain *domain = NULL;
+   struct vfio_group *group = NULL;
+
+   if (!iommu->nesting_info)
+   return -EINVAL;
+
+   /*
+* Only support singleton container with nesting type.
+* If nesting_info is non-NULL, the conatiner should
+* be non-empty. Also domain should be non-empty.
+*/
+   domain = list_first_entry(>domain_list,
+ struct vfio_domain, next);
+   group = list_first_entry(>group_list,
+struct vfio_group, next);
+   dc->group = group;
+   dc->domain = domain->domain;
+   return 0;
+}
+
 static int put_pfn(unsigned long pfn, int prot);
 
 static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
@@ -2349,6 +2379,48 @@ static int vfio_iommu_resv_refresh(struct vfio_iommu 
*iommu,
return ret;
 }
 
+static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
+{
+   struct domain_capsule *dc = (struct domain_capsule *)data;
+   unsigned long arg = *(unsigned long *)dc->data;
+
+   return iommu_uapi_sva_bind_gpasid(dc->domain, dev, (void __user *)arg);
+}
+
+static int vfio_dev_unbind_gpasid_fn(struct device *dev, void *data)
+{
+   struct domain_capsule *dc = (struct domain_capsule *)data;
+   unsigned long arg = *(unsigned long *)dc->data;
+
+   iommu_uapi_sva_unbind_gpasid(dc->domain, dev, (void __user *)arg);
+   return 0;
+}
+
+static int __vfio_dev_unbind_gpasid_fn(struct device *dev, void *data)
+{
+   struct domain_capsule *dc = (struct domain_capsule *)data;
+   struct iommu_gpasid_bind_data *unbind_data =
+   (struct iommu_gpasid_bind_data *)dc->data;
+
+   iommu_sva_unbind_gpasid(dc->domain, dev, unbind_data);
+   return 0;
+}
+
+static void vfio_group_unbind_gpasid_fn(ioasid_t pasid, void *data)
+{
+   struct domain_capsule *dc = (struct domain_capsule *)data;
+   struct iommu_gpasid_bind_data unbind_data;
+
+   unbind_data.argsz = offsetof(struct iommu_gpasid_bind_data, vendor);
+   unbind_data.flags = 0;
+   unbind_data.hpasid = pasid;
+
+   dc->data = _data;
+
+   iommu_group_for_each_dev(dc->group->iommu_group,
+dc, __vfio_dev_unbind_gpasid_fn);
+}
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 

[PATCH v6 12/15] vfio/type1: Add vSVA support for IOMMU-backed mdevs

2020-07-28 Thread Liu Yi L
Recent years, mediated device pass-through framework (e.g. vfio-mdev)
is used to achieve flexible device sharing across domains (e.g. VMs).
Also there are hardware assisted mediated pass-through solutions from
platform vendors. e.g. Intel VT-d scalable mode which supports Intel
Scalable I/O Virtualization technology. Such mdevs are called IOMMU-
backed mdevs as there are IOMMU enforced DMA isolation for such mdevs.
In kernel, IOMMU-backed mdevs are exposed to IOMMU layer by aux-domain
concept, which means mdevs are protected by an iommu domain which is
auxiliary to the domain that the kernel driver primarily uses for DMA
API. Details can be found in the KVM presentation as below:

https://events19.linuxfoundation.org/wp-content/uploads/2017/12/\
Hardware-Assisted-Mediated-Pass-Through-with-VFIO-Kevin-Tian-Intel.pdf

This patch extends NESTING_IOMMU ops to IOMMU-backed mdev devices. The
main requirement is to use the auxiliary domain associated with mdev.

Cc: Kevin Tian 
CC: Jacob Pan 
CC: Jun Tian 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Reviewed-by: Eric Auger 
Signed-off-by: Liu Yi L 
---
v5 -> v6:
*) add review-by from Eric Auger.

v1 -> v2:
*) check the iommu_device to ensure the handling mdev is IOMMU-backed
---
 drivers/vfio/vfio_iommu_type1.c | 40 
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index bf95a0f..9d8f252 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2379,20 +2379,41 @@ static int vfio_iommu_resv_refresh(struct vfio_iommu 
*iommu,
return ret;
 }
 
+static struct device *vfio_get_iommu_device(struct vfio_group *group,
+   struct device *dev)
+{
+   if (group->mdev_group)
+   return vfio_mdev_get_iommu_device(dev);
+   else
+   return dev;
+}
+
 static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
 {
struct domain_capsule *dc = (struct domain_capsule *)data;
unsigned long arg = *(unsigned long *)dc->data;
+   struct device *iommu_device;
+
+   iommu_device = vfio_get_iommu_device(dc->group, dev);
+   if (!iommu_device)
+   return -EINVAL;
 
-   return iommu_uapi_sva_bind_gpasid(dc->domain, dev, (void __user *)arg);
+   return iommu_uapi_sva_bind_gpasid(dc->domain, iommu_device,
+ (void __user *)arg);
 }
 
 static int vfio_dev_unbind_gpasid_fn(struct device *dev, void *data)
 {
struct domain_capsule *dc = (struct domain_capsule *)data;
unsigned long arg = *(unsigned long *)dc->data;
+   struct device *iommu_device;
 
-   iommu_uapi_sva_unbind_gpasid(dc->domain, dev, (void __user *)arg);
+   iommu_device = vfio_get_iommu_device(dc->group, dev);
+   if (!iommu_device)
+   return -EINVAL;
+
+   iommu_uapi_sva_unbind_gpasid(dc->domain, iommu_device,
+(void __user *)arg);
return 0;
 }
 
@@ -2401,8 +2422,13 @@ static int __vfio_dev_unbind_gpasid_fn(struct device 
*dev, void *data)
struct domain_capsule *dc = (struct domain_capsule *)data;
struct iommu_gpasid_bind_data *unbind_data =
(struct iommu_gpasid_bind_data *)dc->data;
+   struct device *iommu_device;
+
+   iommu_device = vfio_get_iommu_device(dc->group, dev);
+   if (!iommu_device)
+   return -EINVAL;
 
-   iommu_sva_unbind_gpasid(dc->domain, dev, unbind_data);
+   iommu_sva_unbind_gpasid(dc->domain, iommu_device, unbind_data);
return 0;
 }
 
@@ -3060,8 +3086,14 @@ static int vfio_dev_cache_invalidate_fn(struct device 
*dev, void *data)
 {
struct domain_capsule *dc = (struct domain_capsule *)data;
unsigned long arg = *(unsigned long *)dc->data;
+   struct device *iommu_device;
+
+   iommu_device = vfio_get_iommu_device(dc->group, dev);
+   if (!iommu_device)
+   return -EINVAL;
 
-   iommu_uapi_cache_invalidate(dc->domain, dev, (void __user *)arg);
+   iommu_uapi_cache_invalidate(dc->domain, iommu_device,
+   (void __user *)arg);
return 0;
 }
 
-- 
2.7.4



[PATCH v6 00/15] vfio: expose virtual Shared Virtual Addressing to VMs

2020-07-28 Thread Liu Yi L
 related code from vfio.c to be a separate
 vfio_pasid.ko.
  f) Add PASID ownership check in IOMMU driver.
  g) Adopted to latest IOMMU UAPI design. Removed IOMMU UAPI version
 check. Added iommu_gpasid_unbind_data for unbind requests from
 userspace.
  h) Define a single ioctl:VFIO_IOMMU_NESTING_OP for bind/unbind_gtbl
 and cahce_invld.
  i) Document dual stage control in vfio.rst.
  Patch v1: 
https://lore.kernel.org/linux-iommu/1584880325-10561-1-git-send-email-yi.l@intel.com/

- RFC v3 -> Patch v1:
  a) Address comments to the PASID request(alloc/free) path
  b) Report PASID alloc/free availabitiy to user-space
  c) Add a vfio_iommu_type1 parameter to support pasid quota tuning
  d) Adjusted to latest ioasid code implementation. e.g. remove the
 code for tracking the allocated PASIDs as latest ioasid code
 will track it, VFIO could use ioasid_free_set() to free all
 PASIDs.
  RFC v3: 
https://lore.kernel.org/linux-iommu/1580299912-86084-1-git-send-email-yi.l@intel.com/

- RFC v2 -> v3:
  a) Refine the whole patchset to fit the roughly parts in this series
  b) Adds complete vfio PASID management framework. e.g. pasid alloc,
  free, reclaim in VM crash/down and per-VM PASID quota to prevent
  PASID abuse.
  c) Adds IOMMU uAPI version check and page table format check to ensure
  version compatibility and hardware compatibility.
  d) Adds vSVA vfio support for IOMMU-backed mdevs.
  RFC v2: 
https://lore.kernel.org/linux-iommu/1571919983-3231-1-git-send-email-yi.l@intel.com/

- RFC v1 -> v2:
  Dropped vfio: VFIO_IOMMU_ATTACH/DETACH_PASID_TABLE.
  RFC v1: 
https://lore.kernel.org/linux-iommu/1562324772-3084-1-git-send-email-yi.l@intel.com/

---
Eric Auger (1):
  vfio: Document dual stage control

Liu Yi L (13):
  vfio/type1: Refactor vfio_iommu_type1_ioctl()
  iommu: Report domain nesting info
  iommu/smmu: Report empty domain nesting info
  vfio/type1: Report iommu nesting info to userspace
  vfio: Add PASID allocation/free support
  iommu/vt-d: Support setting ioasid set to domain
  vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free)
  iommu/vt-d: Check ownership for PASIDs from user-space
  vfio/type1: Support binding guest page tables to PASID
  vfio/type1: Allow invalidating first-level/stage IOMMU cache
  vfio/type1: Add vSVA support for IOMMU-backed mdevs
  vfio/pci: Expose PCIe PASID capability to guest
  iommu/vt-d: Support reporting nesting capability info

Yi Sun (1):
  iommu: Pass domain to sva_unbind_gpasid()

 Documentation/driver-api/vfio.rst  |  75 
 drivers/iommu/arm-smmu-v3.c|  29 +-
 drivers/iommu/arm-smmu.c   |  29 +-
 drivers/iommu/intel/iommu.c| 114 +-
 drivers/iommu/intel/svm.c  |  10 +-
 drivers/iommu/iommu.c  |   2 +-
 drivers/vfio/Kconfig   |   6 +
 drivers/vfio/Makefile  |   1 +
 drivers/vfio/pci/vfio_pci_config.c |   2 +-
 drivers/vfio/vfio_iommu_type1.c| 796 -
 drivers/vfio/vfio_pasid.c  | 284 +
 include/linux/intel-iommu.h|  23 +-
 include/linux/iommu.h  |   4 +-
 include/linux/vfio.h   |  54 +++
 include/uapi/linux/iommu.h |  74 
 include/uapi/linux/vfio.h  |  90 +
 16 files changed, 1391 insertions(+), 202 deletions(-)
 create mode 100644 drivers/vfio/vfio_pasid.c

-- 
2.7.4



[PATCH v6 11/15] vfio/type1: Allow invalidating first-level/stage IOMMU cache

2020-07-28 Thread Liu Yi L
This patch provides an interface allowing the userspace to invalidate
IOMMU cache for first-level page table. It is required when the first
level IOMMU page table is not managed by the host kernel in the nested
translation setup.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
Signed-off-by: Eric Auger 
Signed-off-by: Jacob Pan 
---
v1 -> v2:
*) rename from "vfio/type1: Flush stage-1 IOMMU cache for nesting type"
*) rename vfio_cache_inv_fn() to vfio_dev_cache_invalidate_fn()
*) vfio_dev_cache_inv_fn() always successful
*) remove VFIO_IOMMU_CACHE_INVALIDATE, and reuse VFIO_IOMMU_NESTING_OP
---
 drivers/vfio/vfio_iommu_type1.c | 42 +
 include/uapi/linux/vfio.h   |  3 +++
 2 files changed, 45 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 245436e..bf95a0f 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -3056,6 +3056,45 @@ static long vfio_iommu_handle_pgtbl_op(struct vfio_iommu 
*iommu,
return ret;
 }
 
+static int vfio_dev_cache_invalidate_fn(struct device *dev, void *data)
+{
+   struct domain_capsule *dc = (struct domain_capsule *)data;
+   unsigned long arg = *(unsigned long *)dc->data;
+
+   iommu_uapi_cache_invalidate(dc->domain, dev, (void __user *)arg);
+   return 0;
+}
+
+static long vfio_iommu_invalidate_cache(struct vfio_iommu *iommu,
+   unsigned long arg)
+{
+   struct domain_capsule dc = { .data =  };
+   struct iommu_nesting_info *info;
+   int ret;
+
+   mutex_lock(>lock);
+   /*
+* Cache invalidation is required for any nesting IOMMU,
+* so no need to check system-wide PASID support.
+*/
+   info = iommu->nesting_info;
+   if (!info || !(info->features & IOMMU_NESTING_FEAT_CACHE_INVLD)) {
+   ret = -EOPNOTSUPP;
+   goto out_unlock;
+   }
+
+   ret = vfio_get_nesting_domain_capsule(iommu, );
+   if (ret)
+   goto out_unlock;
+
+   iommu_group_for_each_dev(dc.group->iommu_group, ,
+vfio_dev_cache_invalidate_fn);
+
+out_unlock:
+   mutex_unlock(>lock);
+   return ret;
+}
+
 static long vfio_iommu_type1_nesting_op(struct vfio_iommu *iommu,
unsigned long arg)
 {
@@ -3078,6 +3117,9 @@ static long vfio_iommu_type1_nesting_op(struct vfio_iommu 
*iommu,
case VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL:
ret = vfio_iommu_handle_pgtbl_op(iommu, false, arg + minsz);
break;
+   case VFIO_IOMMU_NESTING_OP_CACHE_INVLD:
+   ret = vfio_iommu_invalidate_cache(iommu, arg + minsz);
+   break;
default:
ret = -EINVAL;
}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9501cfb..48e2fb5 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1225,6 +1225,8 @@ struct vfio_iommu_type1_pasid_request {
  * +-+---+
  * | UNBIND_PGTBL|  struct iommu_gpasid_bind_data|
  * +-+---+
+ * | CACHE_INVLD |  struct iommu_cache_invalidate_info   |
+ * +-+---+
  *
  * returns: 0 on success, -errno on failure.
  */
@@ -1237,6 +1239,7 @@ struct vfio_iommu_type1_nesting_op {
 
 #define VFIO_IOMMU_NESTING_OP_BIND_PGTBL   (0)
 #define VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL (1)
+#define VFIO_IOMMU_NESTING_OP_CACHE_INVLD  (2)
 
 #define VFIO_IOMMU_NESTING_OP  _IO(VFIO_TYPE, VFIO_BASE + 19)
 
-- 
2.7.4



[PATCH v6 08/15] iommu: Pass domain to sva_unbind_gpasid()

2020-07-28 Thread Liu Yi L
From: Yi Sun 

Current interface is good enough for SVA virtualization on an assigned
physical PCI device, but when it comes to mediated devices, a physical
device may attached with multiple aux-domains. Also, for guest unbind,
the PASID to be unbind should be allocated to the VM. This check requires
to know the ioasid_set which is associated with the domain.

So this interface needs to pass in domain info. Then the iommu driver is
able to know which domain will be used for the 2nd stage translation of
the nesting mode and also be able to do PASID ownership check. This patch
passes @domain per the above reason. Also, the prototype of  is
changed frnt" to "u32" as the below link.

https://lore.kernel.org/kvm/27ac7880-bdd3-2891-139e-b4a7cd184...@redhat.com/

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Reviewed-by: Eric Auger 
Signed-off-by: Yi Sun 
Signed-off-by: Liu Yi L 
---
v5 -> v6:
*) use "u32" prototype for @pasid.
*) add review-by from Eric Auger.

v2 -> v3:
*) pass in domain info only
*) use u32 for pasid instead of int type

v1 -> v2:
*) added in v2.
---
 drivers/iommu/intel/svm.c   | 3 ++-
 drivers/iommu/iommu.c   | 2 +-
 include/linux/intel-iommu.h | 3 ++-
 include/linux/iommu.h   | 3 ++-
 4 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index c27d16a..c85b8d5 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -436,7 +436,8 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
struct device *dev,
return ret;
 }
 
-int intel_svm_unbind_gpasid(struct device *dev, int pasid)
+int intel_svm_unbind_gpasid(struct iommu_domain *domain,
+   struct device *dev, u32 pasid)
 {
struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
struct intel_svm_dev *sdev;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 1ce2a61..bee79d7 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2145,7 +2145,7 @@ int iommu_sva_unbind_gpasid(struct iommu_domain *domain, 
struct device *dev,
if (unlikely(!domain->ops->sva_unbind_gpasid))
return -ENODEV;
 
-   return domain->ops->sva_unbind_gpasid(dev, data->hpasid);
+   return domain->ops->sva_unbind_gpasid(domain, dev, data->hpasid);
 }
 EXPORT_SYMBOL_GPL(iommu_sva_unbind_gpasid);
 
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 0d0ab32..f98146b 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -738,7 +738,8 @@ extern int intel_svm_enable_prq(struct intel_iommu *iommu);
 extern int intel_svm_finish_prq(struct intel_iommu *iommu);
 int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
  struct iommu_gpasid_bind_data *data);
-int intel_svm_unbind_gpasid(struct device *dev, int pasid);
+int intel_svm_unbind_gpasid(struct iommu_domain *domain,
+   struct device *dev, u32 pasid);
 struct iommu_sva *intel_svm_bind(struct device *dev, struct mm_struct *mm,
 void *drvdata);
 void intel_svm_unbind(struct iommu_sva *handle);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index b1ff702..80467fc 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -303,7 +303,8 @@ struct iommu_ops {
int (*sva_bind_gpasid)(struct iommu_domain *domain,
struct device *dev, struct iommu_gpasid_bind_data 
*data);
 
-   int (*sva_unbind_gpasid)(struct device *dev, int pasid);
+   int (*sva_unbind_gpasid)(struct iommu_domain *domain,
+struct device *dev, u32 pasid);
 
int (*def_domain_type)(struct device *dev);
 
-- 
2.7.4



[PATCH v6 01/15] vfio/type1: Refactor vfio_iommu_type1_ioctl()

2020-07-28 Thread Liu Yi L
This patch refactors the vfio_iommu_type1_ioctl() to use switch instead of
if-else, and each command got a helper function.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Reviewed-by: Eric Auger 
Suggested-by: Christoph Hellwig 
Signed-off-by: Liu Yi L 
---
v4 -> v5:
*) address comments from Eric Auger, add r-b from Eric.
---
 drivers/vfio/vfio_iommu_type1.c | 394 ++--
 1 file changed, 213 insertions(+), 181 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 5e556ac..3bd70ff 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2453,6 +2453,23 @@ static int vfio_domains_have_iommu_cache(struct 
vfio_iommu *iommu)
return ret;
 }
 
+static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
+   unsigned long arg)
+{
+   switch (arg) {
+   case VFIO_TYPE1_IOMMU:
+   case VFIO_TYPE1v2_IOMMU:
+   case VFIO_TYPE1_NESTING_IOMMU:
+   return 1;
+   case VFIO_DMA_CC_IOMMU:
+   if (!iommu)
+   return 0;
+   return vfio_domains_have_iommu_cache(iommu);
+   default:
+   return 0;
+   }
+}
+
 static int vfio_iommu_iova_add_cap(struct vfio_info_cap *caps,
 struct vfio_iommu_type1_info_cap_iova_range *cap_iovas,
 size_t size)
@@ -2529,241 +2546,256 @@ static int vfio_iommu_migration_build_caps(struct 
vfio_iommu *iommu,
return vfio_info_add_capability(caps, _mig.header, sizeof(cap_mig));
 }
 
-static long vfio_iommu_type1_ioctl(void *iommu_data,
-  unsigned int cmd, unsigned long arg)
+static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
+unsigned long arg)
 {
-   struct vfio_iommu *iommu = iommu_data;
+   struct vfio_iommu_type1_info info;
unsigned long minsz;
+   struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+   unsigned long capsz;
+   int ret;
 
-   if (cmd == VFIO_CHECK_EXTENSION) {
-   switch (arg) {
-   case VFIO_TYPE1_IOMMU:
-   case VFIO_TYPE1v2_IOMMU:
-   case VFIO_TYPE1_NESTING_IOMMU:
-   return 1;
-   case VFIO_DMA_CC_IOMMU:
-   if (!iommu)
-   return 0;
-   return vfio_domains_have_iommu_cache(iommu);
-   default:
-   return 0;
-   }
-   } else if (cmd == VFIO_IOMMU_GET_INFO) {
-   struct vfio_iommu_type1_info info;
-   struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
-   unsigned long capsz;
-   int ret;
-
-   minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+   minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
 
-   /* For backward compatibility, cannot require this */
-   capsz = offsetofend(struct vfio_iommu_type1_info, cap_offset);
+   /* For backward compatibility, cannot require this */
+   capsz = offsetofend(struct vfio_iommu_type1_info, cap_offset);
 
-   if (copy_from_user(, (void __user *)arg, minsz))
-   return -EFAULT;
+   if (copy_from_user(, (void __user *)arg, minsz))
+   return -EFAULT;
 
-   if (info.argsz < minsz)
-   return -EINVAL;
+   if (info.argsz < minsz)
+   return -EINVAL;
 
-   if (info.argsz >= capsz) {
-   minsz = capsz;
-   info.cap_offset = 0; /* output, no-recopy necessary */
-   }
+   if (info.argsz >= capsz) {
+   minsz = capsz;
+   info.cap_offset = 0; /* output, no-recopy necessary */
+   }
 
-   mutex_lock(>lock);
-   info.flags = VFIO_IOMMU_INFO_PGSIZES;
+   mutex_lock(>lock);
+   info.flags = VFIO_IOMMU_INFO_PGSIZES;
 
-   info.iova_pgsizes = iommu->pgsize_bitmap;
+   info.iova_pgsizes = iommu->pgsize_bitmap;
 
-   ret = vfio_iommu_migration_build_caps(iommu, );
+   ret = vfio_iommu_migration_build_caps(iommu, );
 
-   if (!ret)
-   ret = vfio_iommu_iova_build_caps(iommu, );
+   if (!ret)
+   ret = vfio_iommu_iova_build_caps(iommu, );
 
-   mutex_unlock(>lock);
+   mutex_unlock(>lock);
 
-   if (ret)
-   return ret;
+   if (ret)
+   return ret;
 
-   if (caps.size) {
-   info.flags |= VFIO_IOMMU_INFO_CAPS;
+   if (caps.size) {
+   info.flags |= VFIO_IOMMU_INFO_CAPS;
 
-   if (info

RE: [PATCH v5 14/15] vfio: Document dual stage control

2020-07-25 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Saturday, July 18, 2020 9:40 PM
> 
> Hi Yi,
> 
> On 7/12/20 1:21 PM, Liu Yi L wrote:
> > From: Eric Auger 
> >
> > The VFIO API was enhanced to support nested stage control: a bunch of
> > new iotcls and usage guideline.
> ioctls

got it.

> >
> > Let's document the process to follow to set up nested mode.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Reviewed-by: Stefan Hajnoczi 
> > Signed-off-by: Eric Auger 
> > Signed-off-by: Liu Yi L 
> > ---
> > v3 -> v4:
> > *) add review-by from Stefan Hajnoczi
> >
> > v2 -> v3:
> > *) address comments from Stefan Hajnoczi
> >
> > v1 -> v2:
> > *) new in v2, compared with Eric's original version, pasid table bind
> >and fault reporting is removed as this series doesn't cover them.
> >Original version from Eric.
> >https://lkml.org/lkml/2020/3/20/700
> > ---
> >  Documentation/driver-api/vfio.rst | 67
> +++
> >  1 file changed, 67 insertions(+)
> >
> > diff --git a/Documentation/driver-api/vfio.rst 
> > b/Documentation/driver-api/vfio.rst
> > index f1a4d3c..0672c45 100644
> > --- a/Documentation/driver-api/vfio.rst
> > +++ b/Documentation/driver-api/vfio.rst
> > @@ -239,6 +239,73 @@ group and can access them as follows::
> > /* Gratuitous device reset and go... */
> > ioctl(device, VFIO_DEVICE_RESET);
> >
> > +IOMMU Dual Stage Control
> > +
> > +
> > +Some IOMMUs support 2 stages/levels of translation. Stage corresponds to
> > +the ARM terminology while level corresponds to Intel's VTD terminology.
> > +In the following text we use either without distinction.
> > +
> > +This is useful when the guest is exposed with a virtual IOMMU and some
> > +devices are assigned to the guest through VFIO. Then the guest OS can use
> > +stage 1 (GIOVA -> GPA or GVA->GPA), while the hypervisor uses stage 2 for
> > +VM isolation (GPA -> HPA).
> > +
> > +Under dual stage translation, the guest gets ownership of the stage 1 page
> > +tables and also owns stage 1 configuration structures. The hypervisor owns
> > +the root configuration structure (for security reason), including stage 2
> > +configuration. This works as long as configuration structures and page 
> > table
> > +formats are compatible between the virtual IOMMU and the physical IOMMU.
> > +
> > +Assuming the HW supports it, this nested mode is selected by choosing the
> > +VFIO_TYPE1_NESTING_IOMMU type through:
> > +
> > +ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU);
> > +
> > +This forces the hypervisor to use the stage 2, leaving stage 1 available
> > +for guest usage. The guest stage 1 format depends on IOMMU vendor, and
> > +it is the same with the nesting configuration method. User space should
> > +check the format and configuration method after setting nesting type by
> > +using:
> > +
> > +ioctl(container->fd, VFIO_IOMMU_GET_INFO, _info);
> > +
> > +Details can be found in Documentation/userspace-api/iommu.rst. For Intel
> > +VT-d, each stage 1 page table is bound to host by:
> > +
> > +nesting_op->flags = VFIO_IOMMU_NESTING_OP_BIND_PGTBL;
> > +memcpy(_op->data, _data, sizeof(bind_data));
> > +ioctl(container->fd, VFIO_IOMMU_NESTING_OP, nesting_op);
> > +
> > +As mentioned above, guest OS may use stage 1 for GIOVA->GPA or GVA->GPA.
> the guest OS, here and below?

got it.

> > +GVA->GPA page tables are available when PASID (Process Address Space ID)
> > +is exposed to guest. e.g. guest with PASID-capable devices assigned. For
> > +such page table binding, the bind_data should include PASID info, which
> > +is allocated by guest itself or by host. This depends on hardware vendor.
> > +e.g. Intel VT-d requires to allocate PASID from host. This requirement is
> 
> > +defined by the Virtual Command Support in VT-d 3.0 spec, guest software
> > +running on VT-d should allocate PASID from host kernel.
> because VTD 3.0 requires the unicity of the PASID, system wide, instead
> of the above repetition.

I see. perhaps better to say Intel platform. :-) will refine it.

> 
>  To allocate PASID
> > +from host, user space should check the IOMMU_NESTING_FEAT_SYSWIDE_PASID
> > +bit of the nesting info reported from host kernel. VFIO report

RE: [PATCH v5 12/15] vfio/type1: Add vSVA support for IOMMU-backed mdevs

2020-07-23 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric < eric.au...@redhat.com>
> Sent: Monday, July 20, 2020 8:22 PM
> 
> Yi,
> 
> On 7/12/20 1:21 PM, Liu Yi L wrote:
> > Recent years, mediated device pass-through framework (e.g. vfio-mdev)
> > is used to achieve flexible device sharing across domains (e.g. VMs).
> > Also there are hardware assisted mediated pass-through solutions from
> > platform vendors. e.g. Intel VT-d scalable mode which supports Intel
> > Scalable I/O Virtualization technology. Such mdevs are called IOMMU-
> > backed mdevs as there are IOMMU enforced DMA isolation for such mdevs.
> there is IOMMU enforced DMA isolation
> > In kernel, IOMMU-backed mdevs are exposed to IOMMU layer by aux-domain
> > concept, which means mdevs are protected by an iommu domain which is
> > auxiliary to the domain that the kernel driver primarily uses for DMA
> > API. Details can be found in the KVM presentation as below:
> >
> > https://events19.linuxfoundation.org/wp-content/uploads/2017/12/\
> > Hardware-Assisted-Mediated-Pass-Through-with-VFIO-Kevin-Tian-Intel.pdf
> >
> > This patch extends NESTING_IOMMU ops to IOMMU-backed mdev devices. The
> > main requirement is to use the auxiliary domain associated with mdev.
> 
> So as a result vSVM becomes functional for scalable mode mediated devices, 
> right?

yes. as long as the mediated devices reports PASID capability.

> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > CC: Jun Tian 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > ---
> > v1 -> v2:
> > *) check the iommu_device to ensure the handling mdev is IOMMU-backed
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 39
> > +++
> >  1 file changed, 35 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index 960cc59..f1f1ae2 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -2373,20 +2373,41 @@ static int vfio_iommu_resv_refresh(struct
> vfio_iommu *iommu,
> > return ret;
> >  }
> >
> > +static struct device *vfio_get_iommu_device(struct vfio_group *group,
> > +   struct device *dev)
> > +{
> > +   if (group->mdev_group)
> > +   return vfio_mdev_get_iommu_device(dev);
> > +   else
> > +   return dev;
> > +}
> > +
> >  static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)  {
> > struct domain_capsule *dc = (struct domain_capsule *)data;
> > unsigned long arg = *(unsigned long *)dc->data;
> > +   struct device *iommu_device;
> > +
> > +   iommu_device = vfio_get_iommu_device(dc->group, dev);
> > +   if (!iommu_device)
> > +   return -EINVAL;
> >
> > -   return iommu_sva_bind_gpasid(dc->domain, dev, (void __user *)arg);
> > +   return iommu_sva_bind_gpasid(dc->domain, iommu_device,
> > +(void __user *)arg);
> >  }
> >
> >  static int vfio_dev_unbind_gpasid_fn(struct device *dev, void *data)
> > {
> > struct domain_capsule *dc = (struct domain_capsule *)data;
> > unsigned long arg = *(unsigned long *)dc->data;
> > +   struct device *iommu_device;
> >
> > -   iommu_sva_unbind_gpasid(dc->domain, dev, (void __user *)arg);
> > +   iommu_device = vfio_get_iommu_device(dc->group, dev);
> > +   if (!iommu_device)
> > +   return -EINVAL;
> > +
> > +   iommu_sva_unbind_gpasid(dc->domain, iommu_device,
> > +   (void __user *)arg);
> > return 0;
> >  }
> >
> > @@ -2395,8 +2416,13 @@ static int __vfio_dev_unbind_gpasid_fn(struct device
> *dev, void *data)
> > struct domain_capsule *dc = (struct domain_capsule *)data;
> > struct iommu_gpasid_bind_data *unbind_data =
> > (struct iommu_gpasid_bind_data *)dc->data;
> > +   struct device *iommu_device;
> > +
> > +   iommu_device = vfio_get_iommu_device(dc->group, dev);
> > +   if (!iommu_device)
> > +   return -EINVAL;
> >
> > -   __iommu_sva_unbind_gpasid(dc->domain, dev, unbind_data);
> > +   __iommu_sva_unbind_gpasid(dc->domain, iommu_device, unbind_data);
> > return 0;
> >  }
> >
> > @@ -3077,8 +3103,13 @@ static int vfio_dev_cache_invalidate_fn(struct
> > device *dev, void *data)  {
> > struct domain_capsule *dc = (struct domain_capsule *)data;
> > unsigned long arg = *(unsigned long *)dc->data;
> > +   struct device *iommu_device;
> > +
> > +   iommu_device = vfio_get_iommu_device(dc->group, dev);
> > +   if (!iommu_device)
> > +   return -EINVAL;
> >
> > -   iommu_cache_invalidate(dc->domain, dev, (void __user *)arg);
> > +   iommu_cache_invalidate(dc->domain, iommu_device, (void __user
> > +*)arg);
> > return 0;
> >  }
> >
> >
> Besides,
> 
> Looks grood to me
> 
> Reviewed-by: Eric Auger 

thanks :-)

Regards,
Yi Liu

> Eric



RE: [PATCH v5 03/15] iommu/smmu: Report empty domain nesting info

2020-07-23 Thread Liu, Yi L
Hi Jean,

> From: Jean-Philippe Brucker 
> Sent: Friday, July 17, 2020 5:09 PM
> 
> On Thu, Jul 16, 2020 at 10:38:17PM +0200, Auger Eric wrote:
> > Hi Jean,
> >
> > On 7/16/20 5:39 PM, Jean-Philippe Brucker wrote:
> > > On Tue, Jul 14, 2020 at 10:12:49AM +, Liu, Yi L wrote:
> > >>> Have you verified that this doesn't break the existing usage of
> > >>> DOMAIN_ATTR_NESTING in drivers/vfio/vfio_iommu_type1.c?
> > >>
> > >> I didn't have ARM machine on my hand. But I contacted with Jean
> > >> Philippe, he confirmed no compiling issue. I didn't see any code
> > >> getting DOMAIN_ATTR_NESTING attr in current
> drivers/vfio/vfio_iommu_type1.c.
> > >> What I'm adding is to call iommu_domai_get_attr(, DOMAIN_ATTR_NESTIN)
> > >> and won't fail if the iommu_domai_get_attr() returns 0. This patch
> > >> returns an empty nesting info for DOMAIN_ATTR_NESTIN and return
> > >> value is 0 if no error. So I guess it won't fail nesting for ARM.
> > >
> > > I confirm that this series doesn't break the current support for
> > > VFIO_IOMMU_TYPE1_NESTING with an SMMUv3. That said...
> > >
> > > If the SMMU does not support stage-2 then there is a change in behavior
> > > (untested): after the domain is silently switched to stage-1 by the SMMU
> > > driver, VFIO will now query nesting info and obtain -ENODEV. Instead of
> > > succeding as before, the VFIO ioctl will now fail. I believe that's a fix
> > > rather than a regression, it should have been like this since the
> > > beginning. No known userspace has been using VFIO_IOMMU_TYPE1_NESTING
> so
> > > far, so I don't think it should be a concern.
> > But as Yi mentioned ealier, in the current vfio code there is no
> > DOMAIN_ATTR_NESTING query yet.
> 
> That's why something that would have succeeded before will now fail:
> Before this series, if user asked for a VFIO_IOMMU_TYPE1_NESTING, it would
> have succeeded even if the SMMU didn't support stage-2, as the driver
> would have silently fallen back on stage-1 mappings (which work exactly
> the same as stage-2-only since there was no nesting supported). After the
> series, we do check for DOMAIN_ATTR_NESTING so if user asks for
> VFIO_IOMMU_TYPE1_NESTING and the SMMU doesn't support stage-2, the ioctl
> fails.

I think this depends on iommu driver. I noticed current SMMU driver
doesn't check physical IOMMU about nesting capability. So I guess
the SET_IOMMU would still work for SMMU. but it will fail as you
mentioned in prior email, userspace will check the nesting info and
would fail as it gets an empty struct from host.

https://lore.kernel.org/kvm/20200716153959.GA447208@myrica/

> 
> I believe it's a good fix and completely harmless, but wanted to make sure
> no one objects because it's an ABI change.

yes.

Regards,
Yi Liu

> Thanks,
> Jean
> 
> > In my SMMUV3 nested stage series, I added
> > such a query in vfio-pci.c to detect if I need to expose a fault region
> > but I already test both the returned value and the output arg. So to me
> > there is no issue with that change.
> > >
> > > And if userspace queries the nesting properties using the new ABI
> > > introduced in this patchset, it will obtain an empty struct. I think
> > > that's acceptable, but it may be better to avoid adding the nesting cap if
> > > @format is 0?
> > agreed
> >
> > Thanks
> >
> > Eric
> > >
> > > Thanks,
> > > Jean
> > >
> > >>
> > >> @Eric, how about your opinion? your dual-stage vSMMU support may
> > >> also share the vfio_iommu_type1.c code.
> > >>
> > >> Regards,
> > >> Yi Liu
> > >>
> > >>> Will
> > >
> >


RE: [PATCH v5 03/15] iommu/smmu: Report empty domain nesting info

2020-07-23 Thread Liu, Yi L
Hi Jean,

> From: Jean-Philippe Brucker 
> Sent: Thursday, July 16, 2020 11:40 PM
> 
> On Tue, Jul 14, 2020 at 10:12:49AM +0000, Liu, Yi L wrote:
> > > Have you verified that this doesn't break the existing usage of
> > > DOMAIN_ATTR_NESTING in drivers/vfio/vfio_iommu_type1.c?
> >
> > I didn't have ARM machine on my hand. But I contacted with Jean
> > Philippe, he confirmed no compiling issue. I didn't see any code
> > getting DOMAIN_ATTR_NESTING attr in current drivers/vfio/vfio_iommu_type1.c.
> > What I'm adding is to call iommu_domai_get_attr(, DOMAIN_ATTR_NESTIN)
> > and won't fail if the iommu_domai_get_attr() returns 0. This patch
> > returns an empty nesting info for DOMAIN_ATTR_NESTIN and return
> > value is 0 if no error. So I guess it won't fail nesting for ARM.
> 
> I confirm that this series doesn't break the current support for
> VFIO_IOMMU_TYPE1_NESTING with an SMMUv3. That said...

thanks.

> If the SMMU does not support stage-2 then there is a change in behavior
> (untested): after the domain is silently switched to stage-1 by the SMMU
> driver, VFIO will now query nesting info and obtain -ENODEV. Instead of
> succeding as before, the VFIO ioctl will now fail. I believe that's a fix
> rather than a regression, it should have been like this since the
> beginning. No known userspace has been using VFIO_IOMMU_TYPE1_NESTING so
> far, so I don't think it should be a concern.
> 
> And if userspace queries the nesting properties using the new ABI
> introduced in this patchset, it will obtain an empty struct.

yes.

> I think
> that's acceptable, but it may be better to avoid adding the nesting cap if
> @format is 0?

right. will add it in patch 4/15.

Regards,
Yi Liu

> 
> Thanks,
> Jean
> 
> >
> > @Eric, how about your opinion? your dual-stage vSMMU support may
> > also share the vfio_iommu_type1.c code.
> >
> > Regards,
> > Yi Liu
> >
> > > Will


RE: [PATCH v5 15/15] iommu/vt-d: Support reporting nesting capability info

2020-07-20 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Saturday, July 18, 2020 1:14 AM
> 
> Hi Yi,
> 
> Missing a proper commit message. You can comment on the fact you only
> support the case where all the physical iomms have the same CAP/ECAP MASKS

got it. will add it. it looks like the subject is straightforward, so I removed 
commit
message.

> 
> On 7/12/20 1:21 PM, Liu Yi L wrote:
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Jacob Pan 
> > ---
> > v2 -> v3:
> > *) remove cap/ecap_mask in iommu_nesting_info.
> > ---
> >  drivers/iommu/intel/iommu.c | 81
> +++--
> >  include/linux/intel-iommu.h | 16 +
> >  2 files changed, 95 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index a9504cb..9f7ad1a 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -5659,12 +5659,16 @@ static inline bool iommu_pasid_support(void)
> >  static inline bool nested_mode_support(void)
> >  {
> > struct dmar_drhd_unit *drhd;
> > -   struct intel_iommu *iommu;
> > +   struct intel_iommu *iommu, *prev = NULL;
> > bool ret = true;
> >
> > rcu_read_lock();
> > for_each_active_iommu(iommu, drhd) {
> > -   if (!sm_supported(iommu) || !ecap_nest(iommu->ecap)) {
> > +   if (!prev)
> > +   prev = iommu;
> > +   if (!sm_supported(iommu) || !ecap_nest(iommu->ecap) ||
> > +   (VTD_CAP_MASK & (iommu->cap ^ prev->cap)) ||
> > +   (VTD_ECAP_MASK & (iommu->ecap ^ prev->ecap))) {
> > ret = false;
> > break;> }
> > @@ -6079,6 +6083,78 @@ intel_iommu_domain_set_attr(struct iommu_domain
> *domain,
> > return ret;
> >  }
> >
> > +static int intel_iommu_get_nesting_info(struct iommu_domain *domain,
> > +   struct iommu_nesting_info *info)
> > +{
> > +   struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > +   u64 cap = VTD_CAP_MASK, ecap = VTD_ECAP_MASK;
> > +   struct device_domain_info *domain_info;
> > +   struct iommu_nesting_info_vtd vtd;
> > +   unsigned long flags;
> > +   unsigned int size;
> > +
> > +   if (domain->type != IOMMU_DOMAIN_UNMANAGED ||
> > +   !(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE))
> > +   return -ENODEV;
> > +
> > +   if (!info)
> > +   return -EINVAL;
> > +
> > +   size = sizeof(struct iommu_nesting_info) +
> > +   sizeof(struct iommu_nesting_info_vtd);
> > +   /*
> > +* if provided buffer size is smaller than expected, should
> > +* return 0 and also the expected buffer size to caller.
> > +*/
> > +   if (info->size < size) {
> > +   info->size = size;
> > +   return 0;
> > +   }
> > +
> > +   spin_lock_irqsave(_domain_lock, flags);
> > +   /*
> > +* arbitrary select the first domain_info as all nesting
> > +* related capabilities should be consistent across iommu
> > +* units.
> > +*/
> > +   domain_info = list_first_entry(_domain->devices,
> > +  struct device_domain_info, link);
> > +   cap &= domain_info->iommu->cap;
> > +   ecap &= domain_info->iommu->ecap;
> > +   spin_unlock_irqrestore(_domain_lock, flags);
> > +
> > +   info->format = IOMMU_PASID_FORMAT_INTEL_VTD;
> > +   info->features = IOMMU_NESTING_FEAT_SYSWIDE_PASID |
> > +IOMMU_NESTING_FEAT_BIND_PGTBL |
> > +IOMMU_NESTING_FEAT_CACHE_INVLD;
> > +   info->addr_width = dmar_domain->gaw;
> > +   info->pasid_bits = ilog2(intel_pasid_max_id);
> > +   info->padding = 0;
> > +   vtd.flags = 0;
> > +   vtd.padding = 0;
> > +   vtd.cap_reg = cap;
> > +   vtd.ecap_reg = ecap;
> > +
> > +   memcpy(info->data, , sizeof(vtd));
> > +   return 0;
> > +}
> > +
> > +static int intel_iommu_domain_get_attr(struct iommu_domain *domain,
> > +  enum iommu_attr attr, void *data)
> > +{
> > +   switch (attr) {
> > +   case DOMAIN_ATTR_NESTING:
> > +   {
> > +   struct iommu_nesti

RE: [PATCH v5 13/15] vfio/pci: Expose PCIe PASID capability to guest

2020-07-20 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Monday, July 20, 2020 8:35 PM
> 
> Yi,
> 
> On 7/12/20 1:21 PM, Liu Yi L wrote:
> > This patch exposes PCIe PASID capability to guest for assigned devices.
> > Existing vfio_pci driver hides it from guest by setting the capability
> > length as 0 in pci_ext_cap_length[].
> >
> > And this patch only exposes PASID capability for devices which has PCIe
> > PASID extended struture in its configuration space. So VFs, will will
> s/will//

got it.

> > not see PASID capability on VFs as VF doesn't implement PASID extended
> suggested rewording: VFs will not expose the PASID capability as they do
> not implement the PASID extended structure in their config space?

make sense. will modify it.

> > structure in its configuration space. For VF, it is a TODO in future.
> > Related discussion can be found in below link:
> >
> > https://lkml.org/lkml/2020/4/7/693
> 
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > ---
> > v1 -> v2:
> > *) added in v2, but it was sent in a separate patchseries before
> > ---
> >  drivers/vfio/pci/vfio_pci_config.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_config.c 
> > b/drivers/vfio/pci/vfio_pci_config.c
> > index d98843f..07ff2e6 100644
> > --- a/drivers/vfio/pci/vfio_pci_config.c
> > +++ b/drivers/vfio/pci/vfio_pci_config.c
> > @@ -95,7 +95,7 @@ static const u16 pci_ext_cap_length[PCI_EXT_CAP_ID_MAX +
> 1] = {
> > [PCI_EXT_CAP_ID_LTR]=   PCI_EXT_CAP_LTR_SIZEOF,
> > [PCI_EXT_CAP_ID_SECPCI] =   0,  /* not yet */
> > [PCI_EXT_CAP_ID_PMUX]   =   0,  /* not yet */
> > -   [PCI_EXT_CAP_ID_PASID]  =   0,  /* not yet */
> > +   [PCI_EXT_CAP_ID_PASID]  =   PCI_EXT_CAP_PASID_SIZEOF,
> >  };
> >
> >  /*
> >
> Reviewed-by: Eric Auger 

thanks.

Regards,
Yi Liu

> Thanks
> 
> Eric



RE: [PATCH v5 09/15] iommu/vt-d: Check ownership for PASIDs from user-space

2020-07-20 Thread Liu, Yi L
Eric,

> From: Auger Eric 
> Sent: Monday, July 20, 2020 8:38 PM
> 
> Yi,
> 
> On 7/20/20 12:18 PM, Liu, Yi L wrote:
> > Hi Eric,
> >
> >> From: Auger Eric 
> >> Sent: Monday, July 20, 2020 12:06 AM
> >>
> >> Hi Yi,
> >>
> >> On 7/12/20 1:21 PM, Liu Yi L wrote:
> >>> When an IOMMU domain with nesting attribute is used for guest SVA, a
> >>> system-wide PASID is allocated for binding with the device and the domain.
> >>> For security reason, we need to check the PASID passsed from user-space.
> >> passed
> >
> > got it.
> >
> >>> e.g. page table bind/unbind and PASID related cache invalidation.
> >>>
> >>> Cc: Kevin Tian 
> >>> CC: Jacob Pan 
> >>> Cc: Alex Williamson 
> >>> Cc: Eric Auger 
> >>> Cc: Jean-Philippe Brucker 
> >>> Cc: Joerg Roedel 
> >>> Cc: Lu Baolu 
> >>> Signed-off-by: Liu Yi L 
> >>> Signed-off-by: Jacob Pan 
> >>> ---
> >>>  drivers/iommu/intel/iommu.c | 10 ++
> >>>  drivers/iommu/intel/svm.c   |  7 +--
> >>>  2 files changed, 15 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> >>> index 4d54198..a9504cb 100644
> >>> --- a/drivers/iommu/intel/iommu.c
> >>> +++ b/drivers/iommu/intel/iommu.c
> >>> @@ -5436,6 +5436,7 @@ intel_iommu_sva_invalidate(struct iommu_domain
> >> *domain, struct device *dev,
> >>>   int granu = 0;
> >>>   u64 pasid = 0;
> >>>   u64 addr = 0;
> >>> + void *pdata;
> >>>
> >>>   granu = to_vtd_granularity(cache_type, inv_info->granularity);
> >>>   if (granu == -EINVAL) {
> >>> @@ -5456,6 +5457,15 @@ intel_iommu_sva_invalidate(struct iommu_domain
> >> *domain, struct device *dev,
> >>>(inv_info->granu.addr_info.flags &
> >> IOMMU_INV_ADDR_FLAGS_PASID))
> >>>   pasid = inv_info->granu.addr_info.pasid;
> >>>
> >>> + pdata = ioasid_find(dmar_domain->ioasid_sid, pasid, NULL);
> >>> + if (!pdata) {
> >>> + ret = -EINVAL;
> >>> + goto out_unlock;
> >>> + } else if (IS_ERR(pdata)) {
> >>> + ret = PTR_ERR(pdata);
> >>> + goto out_unlock;
> >>> + }
> >>> +
> >>>   switch (BIT(cache_type)) {
> >>>   case IOMMU_CACHE_INV_TYPE_IOTLB:
> >>>   /* HW will ignore LSB bits based on address mask */
> >>> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> >>> index d2c0e1a..212dee0 100644
> >>> --- a/drivers/iommu/intel/svm.c
> >>> +++ b/drivers/iommu/intel/svm.c
> >>> @@ -319,7 +319,7 @@ int intel_svm_bind_gpasid(struct iommu_domain
> *domain,
> >> struct device *dev,
> >>>   dmar_domain = to_dmar_domain(domain);
> >>>
> >>>   mutex_lock(_mutex);
> >>> - svm = ioasid_find(INVALID_IOASID_SET, data->hpasid, NULL);
> I meant while using INVALID_IOASID_SET instead of the actual
> dmar_domain->ioasid_sid. But I think I've now recovered, the asset is
> simply not used ;-)

oh, I think should be using dmar_domain->ioasid_sid from the beginning.
does it look good so far? :-)

Regards,
Yi Liu

> >> I do not get what the call was supposed to do before that patch?
> >
> > you mean patch 10/15 by "that patch", right? the ownership check should
> > be done as to prevent illegal bind request from userspace. before patch
> > 10/15, it should be added.
> >
> >>> + svm = ioasid_find(dmar_domain->ioasid_sid, data->hpasid, NULL);
> >>>   if (IS_ERR(svm)) {
> >>>   ret = PTR_ERR(svm);
> >>>   goto out;
> >>> @@ -436,6 +436,7 @@ int intel_svm_unbind_gpasid(struct iommu_domain
> >> *domain,
> >>>   struct device *dev, ioasid_t pasid)
> >>>  {
> >>>   struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> >>> + struct dmar_domain *dmar_domain;
> >>>   struct intel_svm_dev *sdev;
> >>>   struct intel_svm *svm;
> >>>   int ret = -EINVAL;
> >>> @@ -443,8 +444,10 @@ int intel_svm_unbind_gpasid(struct iommu_domain
> >> *domain,
> >>>   if (WARN_ON(!iommu))
> >>>   return -EINVAL;
> >>>
> >>> + dmar_domain = to_dmar_domain(domain);
> >>> +
> >>>   mutex_lock(_mutex);
> >>> - svm = ioasid_find(INVALID_IOASID_SET, pasid, NULL);
> >>> + svm = ioasid_find(dmar_domain->ioasid_sid, pasid, NULL);
> >> just to make sure, about the locking, can't domain->ioasid_sid change
> >> under the hood?
> >
> > I guess not. intel_svm_unbind_gpasid() and iommu_domain_set_attr()
> > is called by vfio today, and within vfio, there is vfio_iommu->lock.
> OK
> 
> Thanks
> 
> Eric
> >
> > Regards,
> > Yi Liu
> >
> >>
> >> Thanks
> >>
> >> Eric
> >>>   if (!svm) {
> >>>   ret = -EINVAL;
> >>>   goto out;
> >>>
> >



RE: [PATCH v5 11/15] vfio/type1: Allow invalidating first-level/stage IOMMU cache

2020-07-20 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Monday, July 20, 2020 5:42 PM
> 
> Yi,
> 
> On 7/12/20 1:21 PM, Liu Yi L wrote:
> > This patch provides an interface allowing the userspace to invalidate
> > IOMMU cache for first-level page table. It is required when the first
> > level IOMMU page table is not managed by the host kernel in the nested
> > translation setup.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Eric Auger 
> > Signed-off-by: Jacob Pan 
> > ---
> > v1 -> v2:
> > *) rename from "vfio/type1: Flush stage-1 IOMMU cache for nesting type"
> > *) rename vfio_cache_inv_fn() to vfio_dev_cache_invalidate_fn()
> > *) vfio_dev_cache_inv_fn() always successful
> > *) remove VFIO_IOMMU_CACHE_INVALIDATE, and reuse
> VFIO_IOMMU_NESTING_OP
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 50
> +
> >  include/uapi/linux/vfio.h   |  3 +++
> >  2 files changed, 53 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c 
> > b/drivers/vfio/vfio_iommu_type1.c
> > index f0f21ff..960cc59 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -3073,6 +3073,53 @@ static long vfio_iommu_handle_pgtbl_op(struct
> vfio_iommu *iommu,
> > return ret;
> >  }
> >
> > +static int vfio_dev_cache_invalidate_fn(struct device *dev, void *data)
> > +{
> > +   struct domain_capsule *dc = (struct domain_capsule *)data;
> > +   unsigned long arg = *(unsigned long *)dc->data;
> > +
> > +   iommu_cache_invalidate(dc->domain, dev, (void __user *)arg);
> > +   return 0;
> > +}
> > +
> > +static long vfio_iommu_invalidate_cache(struct vfio_iommu *iommu,
> > +   unsigned long arg)
> > +{
> > +   struct domain_capsule dc = { .data =  };
> > +   struct vfio_group *group;
> > +   struct vfio_domain *domain;
> > +   int ret = 0;
> > +   struct iommu_nesting_info *info;
> > +
> > +   mutex_lock(>lock);
> > +   /*
> > +* Cache invalidation is required for any nesting IOMMU,
> > +* so no need to check system-wide PASID support.
> > +*/
> > +   info = iommu->nesting_info;
> > +   if (!info || !(info->features & IOMMU_NESTING_FEAT_CACHE_INVLD)) {
> > +   ret = -EOPNOTSUPP;
> > +   goto out_unlock;
> > +   }
> > +
> > +   group = vfio_find_nesting_group(iommu);
> so I see you reuse it here. But still wondering if you cant't directly
> set dc.domain and dc.group group below using list_firt_entry?

I guess yes for current implementation. I also considered if I can
get a helper function to retrun a dc with group and domain field
initialized as it is common code used by both bind/unbind and cache_inv
path. perhaps something like get_domain_capsule_for_nesting()

> > +   if (!group) {
> > +   ret = -EINVAL;
> > +   goto out_unlock;
> > +   }
> > +
> > +   domain = list_first_entry(>domain_list,
> > + struct vfio_domain, next);
> > +   dc.group = group;
> > +   dc.domain = domain->domain;
> > +   iommu_group_for_each_dev(group->iommu_group, ,
> > +vfio_dev_cache_invalidate_fn);
> > +
> > +out_unlock:
> > +   mutex_unlock(>lock);
> > +   return ret;
> > +}
> > +
> >  static long vfio_iommu_type1_nesting_op(struct vfio_iommu *iommu,
> > unsigned long arg)
> >  {
> > @@ -3095,6 +3142,9 @@ static long vfio_iommu_type1_nesting_op(struct
> vfio_iommu *iommu,
> > case VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL:
> > ret = vfio_iommu_handle_pgtbl_op(iommu, false, arg + minsz);
> > break;
> > +   case VFIO_IOMMU_NESTING_OP_CACHE_INVLD:
> > +   ret = vfio_iommu_invalidate_cache(iommu, arg + minsz);
> > +   break;
> > default:
> > ret = -EINVAL;
> > }
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index a8ad786..845a5800 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -1225,6 +1225,8 @@ struct vfio_iommu_type1_pasid_request {
> >   * +-+---+
> >   * | UNBIND_PGTBL|  struct iommu_g

RE: [PATCH v5 10/15] vfio/type1: Support binding guest page tables to PASID

2020-07-20 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Monday, July 20, 2020 5:37 PM
> 
> Yi,
> 
> On 7/12/20 1:21 PM, Liu Yi L wrote:
> > Nesting translation allows two-levels/stages page tables, with 1st level
> > for guest translations (e.g. GVA->GPA), 2nd level for host translations
> > (e.g. GPA->HPA). This patch adds interface for binding guest page tables
> > to a PASID. This PASID must have been allocated to user space before the
> by the userspace?

yes, it is. will modify it.

> > binding request.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Jean-Philippe Brucker 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Jacob Pan 
> > ---
> > v3 -> v4:
> > *) address comments from Alex on v3
> >
> > v2 -> v3:
> > *) use __iommu_sva_unbind_gpasid() for unbind call issued by VFIO
> >https://lore.kernel.org/linux-iommu/1592931837-58223-6-git-send-email-
> jacob.jun@linux.intel.com/
> >
> > v1 -> v2:
> > *) rename subject from "vfio/type1: Bind guest page tables to host"
> > *) remove VFIO_IOMMU_BIND, introduce VFIO_IOMMU_NESTING_OP to support
> bind/
> >unbind guet page table
> > *) replaced vfio_iommu_for_each_dev() with a group level loop since this
> >series enforces one group per container w/ nesting type as start.
> > *) rename vfio_bind/unbind_gpasid_fn() to vfio_dev_bind/unbind_gpasid_fn()
> > *) vfio_dev_unbind_gpasid() always successful
> > *) use vfio_mm->pasid_lock to avoid race between PASID free and page table
> >bind/unbind
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 166
> 
> >  drivers/vfio/vfio_pasid.c   |  26 +++
> >  include/linux/vfio.h|  20 +
> >  include/uapi/linux/vfio.h   |  31 
> >  4 files changed, 243 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c 
> > b/drivers/vfio/vfio_iommu_type1.c
> > index 55b4065..f0f21ff 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -149,6 +149,30 @@ struct vfio_regions {
> >  #define DIRTY_BITMAP_PAGES_MAX  ((u64)INT_MAX)
> >  #define DIRTY_BITMAP_SIZE_MAX
> DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
> >
> > +struct domain_capsule {
> > +   struct vfio_group *group;
> > +   struct iommu_domain *domain;
> > +   void *data;
> > +};
> > +
> > +/* iommu->lock must be held */
> > +static struct vfio_group *vfio_find_nesting_group(struct vfio_iommu *iommu)
> > +{
> > +   struct vfio_domain *d;
> > +   struct vfio_group *group = NULL;
> > +
> > +   if (!iommu->nesting_info)
> > +   return NULL;
> > +
> > +   /* only support singleton container with nesting type */
> > +   list_for_each_entry(d, >domain_list, next) {
> > +   list_for_each_entry(group, >group_list, next) {
> > +   break;
> use list_first_entry?

yeah, based on the discussion in below link, we only support singleton
container with nesting type, also if no group in container, the nesting_info
will be cleared. so yes, list_first_entry will work as well.

https://lkml.org/lkml/2020/5/15/1028

> > +   }
> > +   }
> > +   return group;
> > +}
> > +
> >  static int put_pfn(unsigned long pfn, int prot);
> >
> >  static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu
> *iommu,
> > @@ -2349,6 +2373,48 @@ static int vfio_iommu_resv_refresh(struct vfio_iommu
> *iommu,
> > return ret;
> >  }
> >
> > +static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
> > +{
> > +   struct domain_capsule *dc = (struct domain_capsule *)data;
> > +   unsigned long arg = *(unsigned long *)dc->data;
> > +
> > +   return iommu_sva_bind_gpasid(dc->domain, dev, (void __user *)arg);
> > +}
> > +
> > +static int vfio_dev_unbind_gpasid_fn(struct device *dev, void *data)
> > +{
> > +   struct domain_capsule *dc = (struct domain_capsule *)data;
> > +   unsigned long arg = *(unsigned long *)dc->data;
> > +
> > +   iommu_sva_unbind_gpasid(dc->domain, dev, (void __user *)arg);
> > +   return 0;
> > +}
> > +
> > +static int __vfio_dev_unbind_gpasid_fn(struct device *dev, void *data)
> > +{
> > +   struct domain_capsule *dc = (struct domain_capsule *)data;
> > +   struct iommu_gpasid_bin

RE: [PATCH v5 09/15] iommu/vt-d: Check ownership for PASIDs from user-space

2020-07-20 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Monday, July 20, 2020 12:06 AM
> 
> Hi Yi,
> 
> On 7/12/20 1:21 PM, Liu Yi L wrote:
> > When an IOMMU domain with nesting attribute is used for guest SVA, a
> > system-wide PASID is allocated for binding with the device and the domain.
> > For security reason, we need to check the PASID passsed from user-space.
> passed

got it.

> > e.g. page table bind/unbind and PASID related cache invalidation.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Jacob Pan 
> > ---
> >  drivers/iommu/intel/iommu.c | 10 ++
> >  drivers/iommu/intel/svm.c   |  7 +--
> >  2 files changed, 15 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 4d54198..a9504cb 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -5436,6 +5436,7 @@ intel_iommu_sva_invalidate(struct iommu_domain
> *domain, struct device *dev,
> > int granu = 0;
> > u64 pasid = 0;
> > u64 addr = 0;
> > +   void *pdata;
> >
> > granu = to_vtd_granularity(cache_type, inv_info->granularity);
> > if (granu == -EINVAL) {
> > @@ -5456,6 +5457,15 @@ intel_iommu_sva_invalidate(struct iommu_domain
> *domain, struct device *dev,
> >  (inv_info->granu.addr_info.flags &
> IOMMU_INV_ADDR_FLAGS_PASID))
> > pasid = inv_info->granu.addr_info.pasid;
> >
> > +   pdata = ioasid_find(dmar_domain->ioasid_sid, pasid, NULL);
> > +   if (!pdata) {
> > +   ret = -EINVAL;
> > +   goto out_unlock;
> > +   } else if (IS_ERR(pdata)) {
> > +   ret = PTR_ERR(pdata);
> > +   goto out_unlock;
> > +   }
> > +
> > switch (BIT(cache_type)) {
> > case IOMMU_CACHE_INV_TYPE_IOTLB:
> > /* HW will ignore LSB bits based on address mask */
> > diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> > index d2c0e1a..212dee0 100644
> > --- a/drivers/iommu/intel/svm.c
> > +++ b/drivers/iommu/intel/svm.c
> > @@ -319,7 +319,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain,
> struct device *dev,
> > dmar_domain = to_dmar_domain(domain);
> >
> > mutex_lock(_mutex);
> > -   svm = ioasid_find(INVALID_IOASID_SET, data->hpasid, NULL);
> I do not get what the call was supposed to do before that patch?

you mean patch 10/15 by "that patch", right? the ownership check should
be done as to prevent illegal bind request from userspace. before patch
10/15, it should be added.

> > +   svm = ioasid_find(dmar_domain->ioasid_sid, data->hpasid, NULL);
> > if (IS_ERR(svm)) {
> > ret = PTR_ERR(svm);
> > goto out;
> > @@ -436,6 +436,7 @@ int intel_svm_unbind_gpasid(struct iommu_domain
> *domain,
> > struct device *dev, ioasid_t pasid)
> >  {
> > struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > +   struct dmar_domain *dmar_domain;
> > struct intel_svm_dev *sdev;
> > struct intel_svm *svm;
> > int ret = -EINVAL;
> > @@ -443,8 +444,10 @@ int intel_svm_unbind_gpasid(struct iommu_domain
> *domain,
> > if (WARN_ON(!iommu))
> > return -EINVAL;
> >
> > +   dmar_domain = to_dmar_domain(domain);
> > +
> > mutex_lock(_mutex);
> > -   svm = ioasid_find(INVALID_IOASID_SET, pasid, NULL);
> > +   svm = ioasid_find(dmar_domain->ioasid_sid, pasid, NULL);
> just to make sure, about the locking, can't domain->ioasid_sid change
> under the hood?

I guess not. intel_svm_unbind_gpasid() and iommu_domain_set_attr()
is called by vfio today, and within vfio, there is vfio_iommu->lock.

Regards,
Yi Liu

> 
> Thanks
> 
> Eric
> > if (!svm) {
> > ret = -EINVAL;
> > goto out;
> >



RE: [PATCH v5 08/15] iommu: Pass domain to sva_unbind_gpasid()

2020-07-20 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Sunday, July 19, 2020 11:38 PM
> 
> Yi,
> 
> On 7/12/20 1:21 PM, Liu Yi L wrote:
> > From: Yi Sun 
> >
> > Current interface is good enough for SVA virtualization on an assigned
> > physical PCI device, but when it comes to mediated devices, a physical
> > device may attached with multiple aux-domains. Also, for guest unbind,
> > the PASID to be unbind should be allocated to the VM. This check
> > requires to know the ioasid_set which is associated with the domain.
> >
> > So this interface needs to pass in domain info. Then the iommu driver
> > is able to know which domain will be used for the 2nd stage
> > translation of the nesting mode and also be able to do PASID ownership
> > check. This patch passes @domain per the above reason.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Yi Sun 
> > Signed-off-by: Liu Yi L 
> > ---
> > v2 -> v3:
> > *) pass in domain info only
> > *) use ioasid_t for pasid instead of int type
> >
> > v1 -> v2:
> > *) added in v2.
> > ---
> >  drivers/iommu/intel/svm.c   | 3 ++-
> >  drivers/iommu/iommu.c   | 2 +-
> >  include/linux/intel-iommu.h | 3 ++-
> >  include/linux/iommu.h   | 3 ++-
> >  4 files changed, 7 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> > index b9a9c55..d2c0e1a 100644
> > --- a/drivers/iommu/intel/svm.c
> > +++ b/drivers/iommu/intel/svm.c
> > @@ -432,7 +432,8 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain,
> struct device *dev,
> > return ret;
> >  }
> >
> > -int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> > +int intel_svm_unbind_gpasid(struct iommu_domain *domain,
> > +   struct device *dev, ioasid_t pasid)
> int -> ioasid_t proto change is not described in the commit message,

oops, yes. btw. I noticed there is another thread which is going to use
u32 for pasid. perhaps I need to drop such change.

https://lore.kernel.org/linux-iommu/1594684087-61184-2-git-send-email-fenghua...@intel.com/#Z30drivers:iommu:iommu.c

> >  {
> > struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > struct intel_svm_dev *sdev;
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index
> > 7910249..d3e554c 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -2151,7 +2151,7 @@ int __iommu_sva_unbind_gpasid(struct iommu_domain
> *domain, struct device *dev,
> > if (unlikely(!domain->ops->sva_unbind_gpasid))
> > return -ENODEV;
> >
> > -   return domain->ops->sva_unbind_gpasid(dev, data->hpasid);
> > +   return domain->ops->sva_unbind_gpasid(domain, dev, data->hpasid);
> >  }
> >  EXPORT_SYMBOL_GPL(__iommu_sva_unbind_gpasid);
> >
> > diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> > index 0d0ab32..18f292e 100644
> > --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -738,7 +738,8 @@ extern int intel_svm_enable_prq(struct intel_iommu
> > *iommu);  extern int intel_svm_finish_prq(struct intel_iommu *iommu);
> > int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
> >   struct iommu_gpasid_bind_data *data); -int
> > intel_svm_unbind_gpasid(struct device *dev, int pasid);
> > +int intel_svm_unbind_gpasid(struct iommu_domain *domain,
> > +   struct device *dev, ioasid_t pasid);
> >  struct iommu_sva *intel_svm_bind(struct device *dev, struct mm_struct *mm,
> >  void *drvdata);
> >  void intel_svm_unbind(struct iommu_sva *handle); diff --git
> > a/include/linux/iommu.h b/include/linux/iommu.h index e84a1d5..ca5edd8
> > 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -303,7 +303,8 @@ struct iommu_ops {
> > int (*sva_bind_gpasid)(struct iommu_domain *domain,
> > struct device *dev, struct iommu_gpasid_bind_data 
> > *data);
> >
> > -   int (*sva_unbind_gpasid)(struct device *dev, int pasid);
> > +   int (*sva_unbind_gpasid)(struct iommu_domain *domain,
> > +struct device *dev, ioasid_t pasid);
> >
> > int (*def_domain_type)(struct device *dev);
> >
> >
> Besides
> Reviewed-by: Eric Auger 
>

thanks.

Regards,
Yi Liu

> Eric



RE: [PATCH v5 07/15] vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free)

2020-07-20 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Sunday, July 19, 2020 11:39 PM
> 
> Yi,
> 
> On 7/12/20 1:21 PM, Liu Yi L wrote:
> > This patch allows user space to request PASID allocation/free, e.g.
> > when serving the request from the guest.
> >
> > PASIDs that are not freed by userspace are automatically freed when
> > the IOASID set is destroyed when process exits.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Yi Sun 
> > Signed-off-by: Jacob Pan 
> > ---
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > *) the comments for the PASID_FREE request is addressed in patch 5/15 of
> >this series.
> >
> > v3 -> v4:
> > *) address comments from v3, except the below comment against the range
> >of PASID_FREE request. needs more help on it.
> > "> +if (req.range.min > req.range.max)
> >
> > Is it exploitable that a user can spin the kernel for a long time in
> > the case of a free by calling this with [0, MAX_UINT] regardless of
> > their actual allocations?"
> >
> > https://lore.kernel.org/linux-iommu/20200702151832.048b4...@x1.home/
> >
> > v1 -> v2:
> > *) move the vfio_mm related code to be a seprate module
> > *) use a single structure for alloc/free, could support a range of
> > PASIDs
> > *) fetch vfio_mm at group_attach time instead of at iommu driver open
> > time
> > ---
> >  drivers/vfio/Kconfig|  1 +
> >  drivers/vfio/vfio_iommu_type1.c | 85
> +
> >  drivers/vfio/vfio_pasid.c   | 10 +
> >  include/linux/vfio.h|  6 +++
> >  include/uapi/linux/vfio.h   | 37 ++
> >  5 files changed, 139 insertions(+)
> >
> > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index
> > 3d8a108..95d90c6 100644
> > --- a/drivers/vfio/Kconfig
> > +++ b/drivers/vfio/Kconfig
> > @@ -2,6 +2,7 @@
> >  config VFIO_IOMMU_TYPE1
> > tristate
> > depends on VFIO
> > +   select VFIO_PASID if (X86)
> > default n
> >
> >  config VFIO_IOMMU_SPAPR_TCE
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index ed80104..55b4065 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -76,6 +76,7 @@ struct vfio_iommu {
> > booldirty_page_tracking;
> > boolpinned_page_dirty_scope;
> > struct iommu_nesting_info   *nesting_info;
> > +   struct vfio_mm  *vmm;
> >  };
> >
> >  struct vfio_domain {
> > @@ -1937,6 +1938,11 @@ static void vfio_iommu_iova_insert_copy(struct
> > vfio_iommu *iommu,
> >
> >  static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
> > {
> > +   if (iommu->vmm) {
> > +   vfio_mm_put(iommu->vmm);
> > +   iommu->vmm = NULL;
> > +   }
> > +
> > kfree(iommu->nesting_info);
> > iommu->nesting_info = NULL;
> >  }
> > @@ -2071,6 +2077,26 @@ static int vfio_iommu_type1_attach_group(void
> *iommu_data,
> > iommu->nesting_info);
> > if (ret)
> > goto out_detach;
> > +
> > +   if (iommu->nesting_info->features &
> > +   IOMMU_NESTING_FEAT_SYSWIDE_PASID)
> {
> > +   struct vfio_mm *vmm;
> > +   int sid;
> > +
> > +   vmm = vfio_mm_get_from_task(current);
> > +   if (IS_ERR(vmm)) {
> > +   ret = PTR_ERR(vmm);
> > +   goto out_detach;
> > +   }
> > +   iommu->vmm = vmm;
> > +
> > +   sid = vfio_mm_ioasid_sid(vmm);
> > +   ret = iommu_domain_set_attr(domain->domain,
> > +   DOMAIN_ATTR_IOASID_SID,
> > +   );
> > +   if (ret)
> > +   goto out_detach;
> > +   }
> > }
> >
> > /* Get aperture info */
> > @@ -2855,6 +2881,63 @@ static int vfio_iommu_type1_

RE: [PATCH v5 04/15] vfio/type1: Report iommu nesting info to userspace

2020-07-20 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Monday, July 20, 2020 4:33 PM
> 
> Yi,
> 
> On 7/20/20 9:51 AM, Liu, Yi L wrote:
> > Hi Eric,
> >
> >> From: Auger Eric 
> >> Sent: Saturday, July 18, 2020 1:34 AM
> >>
> >> Yi,
> >>
> >> On 7/12/20 1:20 PM, Liu Yi L wrote:
> >>> This patch exports iommu nesting capability info to user space
> >>> through VFIO. User space is expected to check this info for supported 
> >>> uAPIs (e.g.
> >> it is not only to check the supported uAPIS but rather to know which
> >> callbacks it must call upon vIOMMU events and which features are
> >> supported by the physical IOMMU.
> >
> > yes, will refine the description per your comment.
> >
> >>> PASID alloc/free, bind page table, and cache invalidation) and the
> >>> vendor specific format information for first level/stage page table
> >>> that will be bound to.
> >>>
> >>> The nesting info is available only after the nesting iommu type is
> >>> set for a container.
> >> to NESTED type
> >
> > you mean "The nesting info is available only after container set to be 
> > NESTED
> type."
> >
> > right?
> correct

got you.

> >
> >>  Current implementation imposes one limitation - one
> >>> nesting container should include at most one group. The philosophy
> >>> of vfio container is having all groups/devices within the container
> >>> share the same IOMMU context. When vSVA is enabled, one IOMMU
> >>> context could include one 2nd-level address space and multiple 1st-level 
> >>> address
> spaces.
> >>> While the 2nd-level address space is reasonably sharable by multiple
> >>> groups , blindly sharing 1st-level address spaces across all groups
> >>> within the container might instead break the guest expectation. In
> >>> the future sub/ super container concept might be introduced to allow
> >>> partial address space sharing within an IOMMU context. But for now
> >>> let's go with this restriction by requiring singleton container for
> >>> using nesting iommu features. Below link has the related discussion about 
> >>> this
> decision.
> >>
> >> Maybe add a note about SMMU related changes spotted by Jean.
> >
> > I guess you mean the comments Jean gave in patch 3/15, right? I'll
> > copy his comments and also give the below link as well.
> >
> > https://lore.kernel.org/kvm/20200717090900.GC4850@myrica/
> correct

I see.

Regards,
Yi Liu

> Thanks
> 
> Eric
> >
> >>>
> >>> https://lkml.org/lkml/2020/5/15/1028
> >>>
> >>> Cc: Kevin Tian 
> >>> CC: Jacob Pan 
> >>> Cc: Alex Williamson 
> >>> Cc: Eric Auger 
> >>> Cc: Jean-Philippe Brucker 
> >>> Cc: Joerg Roedel 
> >>> Cc: Lu Baolu 
> >>> Signed-off-by: Liu Yi L 
> >>> ---
> >>> v4 -> v5:
> >>> *) address comments from Eric Auger.
> >>> *) return struct iommu_nesting_info for
> >> VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as
> >>>cap is much "cheap", if needs extension in future, just define another 
> >>> cap.
> >>>https://lore.kernel.org/kvm/20200708132947.5b7ee...@x1.home/
> >>>
> >>> v3 -> v4:
> >>> *) address comments against v3.
> >>>
> >>> v1 -> v2:
> >>> *) added in v2
> >>> ---
> >>>  drivers/vfio/vfio_iommu_type1.c | 102
> >> +++-
> >>>  include/uapi/linux/vfio.h   |  19 
> >>>  2 files changed, 109 insertions(+), 12 deletions(-)
> >>>
> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c
> >>> b/drivers/vfio/vfio_iommu_type1.c index 3bd70ff..ed80104 100644
> >>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>> @@ -62,18 +62,20 @@ MODULE_PARM_DESC(dma_entry_limit,
> >>>"Maximum number of user DMA mappings per container (65535).");
> >>>
> >>>  struct vfio_iommu {
> >>> - struct list_headdomain_list;
> >>> - struct list_headiova_list;
> >>> - struct vfio_domain  *external_domain; /* domain for external user */
> >>> - struct mutexlock;
> >>> - struct rb_root  dma_list;
> &g

RE: [PATCH v5 05/15] vfio: Add PASID allocation/free support

2020-07-20 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Monday, July 20, 2020 4:26 PM
[...]
> >>> +int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max) {
> >>> + ioasid_t pasid;
> >>> + struct vfio_pasid *vid;
> >>> +
> >>> + pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> >>> + if (pasid == INVALID_IOASID)
> >>> + return -ENOSPC;
> >>> +
> >>> + vid = kzalloc(sizeof(*vid), GFP_KERNEL);
> >>> + if (!vid) {
> >>> + ioasid_free(pasid);
> >>> + return -ENOMEM;
> >>> + }
> >>> +
> >>> + vid->pasid = pasid;
> >>> +
> >>> + mutex_lock(>pasid_lock);
> >>> + vfio_link_pasid(vmm, vid);
> >>> + mutex_unlock(>pasid_lock);
> >>> +
> >>> + return pasid;
> >>> +}
> >> I am not totally convinced by your previous reply on EXPORT_SYMBOL_GP()
> >> irrelevance in this patch. But well ;-)
> >
> > I recalled my memory, I think it's made per a comment from Chris.
> > I guess it may make sense to export symbols together with the exact
> > driver user of it in this patch as well :-) but maybe I misunderstood
> > him. if yes, I may add the symbol export back in this patch.
> >
> > https://lore.kernel.org/linux-iommu/20200331075331.ga26...@infradead.org/#t
> OK I don't know the best practice here. Anyway there is no caller at
> this stage either. I think you may describe in the commit message the
> vfio_iommu_type1 will be the eventual user of the exported functions in
> this module, what are the exact exported functions here. You may also
> explain the motivation behind creating a separate module.

got it. will add it.

Regards,
Yi Liu




RE: [PATCH v5 06/15] iommu/vt-d: Support setting ioasid set to domain

2020-07-20 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Sunday, July 19, 2020 11:38 PM
> 
> Yi,
> 
> On 7/12/20 1:21 PM, Liu Yi L wrote:
> > From IOMMU p.o.v., PASIDs allocated and managed by external components
> > (e.g. VFIO) will be passed in for gpasid_bind/unbind operation. IOMMU
> > needs some knowledge to check the PASID ownership, hence add an
> > interface for those components to tell the PASID owner.
> >
> > In latest kernel design, PASID ownership is managed by IOASID set
> > where the PASID is allocated from. This patch adds support for setting
> > ioasid set ID to the domains used for nesting/vSVA. Subsequent SVA
> > operations on the PASID will be checked against its IOASID set for proper
> ownership.
> Subsequent SVA operations will check the PASID against its IOASID set for 
> proper
> ownership.

got it.

> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Jacob Pan 
> > ---
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > ---
> >  drivers/iommu/intel/iommu.c | 22 ++
> > include/linux/intel-iommu.h |  4 
> >  include/linux/iommu.h   |  1 +
> >  3 files changed, 27 insertions(+)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 72ae6a2..4d54198 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -1793,6 +1793,7 @@ static struct dmar_domain *alloc_domain(int flags)
> > if (first_level_by_default())
> > domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
> > domain->has_iotlb_device = false;
> > +   domain->ioasid_sid = INVALID_IOASID_SET;
> > INIT_LIST_HEAD(>devices);
> >
> > return domain;
> > @@ -6039,6 +6040,27 @@ intel_iommu_domain_set_attr(struct iommu_domain
> *domain,
> > }
> > spin_unlock_irqrestore(_domain_lock, flags);
> > break;
> > +   case DOMAIN_ATTR_IOASID_SID:
> > +   {
> > +   int sid = *(int *)data;
> > +
> > +   if (!(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE)) {
> > +   ret = -ENODEV;
> > +   break;
> > +   }
> > +   spin_lock_irqsave(_domain_lock, flags);
> I think the lock should be taken before the DOMAIN_FLAG_NESTING_MODE check.
> Otherwise, the flags can be theretically changed inbetween the check and the 
> test
> below?

I see. will correct it.

Thanks,
Yi Liu

> Thanks
> 
> Eric
> > +   if (dmar_domain->ioasid_sid != INVALID_IOASID_SET &&
> > +   dmar_domain->ioasid_sid != sid) {
> > +   pr_warn_ratelimited("multi ioasid_set (%d:%d) setting",
> > +   dmar_domain->ioasid_sid, sid);
> > +   ret = -EBUSY;
> > +   spin_unlock_irqrestore(_domain_lock, flags);
> > +   break;
> > +   }
> > +   dmar_domain->ioasid_sid = sid;
> > +   spin_unlock_irqrestore(_domain_lock, flags);
> > +   break;
> > +   }
> > default:
> > ret = -EINVAL;
> > break;
> > diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> > index 3f23c26..0d0ab32 100644
> > --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -549,6 +549,10 @@ struct dmar_domain {
> >2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
> > u64 max_addr;   /* maximum mapped address */
> >
> > +   int ioasid_sid; /*
> > +* the ioasid set which tracks all
> > +* PASIDs used by the domain.
> > +*/
> > int default_pasid;  /*
> >  * The default pasid used for non-SVM
> >  * traffic on mediated devices.
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h index
> > 7ca9d48..e84a1d5 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -124,6 +124,7 @@ enum iommu_attr {
> > DOMAIN_ATTR_FSL_PAMUV1,
> > DOMAIN_ATTR_NESTING,/* two stages of translation */
> > DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE,
> > +   DOMAIN_ATTR_IOASID_SID,
> > DOMAIN_ATTR_MAX,
> >  };
> >
> >



RE: [PATCH v5 05/15] vfio: Add PASID allocation/free support

2020-07-20 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Sunday, July 19, 2020 11:39 PM
> 
> Yi,
> 
> On 7/12/20 1:21 PM, Liu Yi L wrote:
> > Shared Virtual Addressing (a.k.a Shared Virtual Memory) allows sharing
> > multiple process virtual address spaces with the device for simplified
> > programming model. PASID is used to tag an virtual address space in
> > DMA requests and to identify the related translation structure in
> > IOMMU. When a PASID-capable device is assigned to a VM, we want the
> > same capability of using PASID to tag guest process virtual address
> > spaces to achieve virtual SVA (vSVA).
> >
> > PASID management for guest is vendor specific. Some vendors (e.g.
> > Intel
> > VT-d) requires system-wide managed PASIDs cross all devices,
> > regardless
> across?

yep. will correct it.

> > of whether a device is used by host or assigned to guest. Other
> > vendors (e.g. ARM SMMU) may allow PASIDs managed per-device thus could
> > be fully delegated to the guest for assigned devices.
> >
> > For system-wide managed PASIDs, this patch introduces a vfio module to
> > handle explicit PASID alloc/free requests from guest. Allocated PASIDs
> > are associated to a process (or, mm_struct) in IOASID core. A vfio_mm
> > object is introduced to track mm_struct. Multiple VFIO containers
> > within a process share the same vfio_mm object.
> >
> > A quota mechanism is provided to prevent malicious user from
> > exhausting available PASIDs. Currently the quota is a global parameter
> > applied to all VFIO devices. In the future per-device quota might be
> > supported too.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Suggested-by: Alex Williamson 
> > Signed-off-by: Liu Yi L 
> > ---
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > *) address the comments from Alex on the pasid free range support. Added
> >per vfio_mm pasid r-b tree.
> >https://lore.kernel.org/kvm/20200709082751.32074...@x1.home/
> >
> > v3 -> v4:
> > *) fix lock leam in vfio_mm_get_from_task()
> > *) drop pasid_quota field in struct vfio_mm
> > *) vfio_mm_get_from_task() returns ERR_PTR(-ENOTTY) when
> > !CONFIG_VFIO_PASID
> >
> > v1 -> v2:
> > *) added in v2, split from the pasid alloc/free support of v1
> > ---
> >  drivers/vfio/Kconfig  |   5 +
> >  drivers/vfio/Makefile |   1 +
> >  drivers/vfio/vfio_pasid.c | 235
> ++
> >  include/linux/vfio.h  |  28 ++
> >  4 files changed, 269 insertions(+)
> >  create mode 100644 drivers/vfio/vfio_pasid.c
> >
> > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index
> > fd17db9..3d8a108 100644
> > --- a/drivers/vfio/Kconfig
> > +++ b/drivers/vfio/Kconfig
> > @@ -19,6 +19,11 @@ config VFIO_VIRQFD
> > depends on VFIO && EVENTFD
> > default n
> >
> > +config VFIO_PASID
> > +   tristate
> > +   depends on IOASID && VFIO
> > +   default n
> > +
> >  menuconfig VFIO
> > tristate "VFIO Non-Privileged userspace driver framework"
> > depends on IOMMU_API
> > diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index
> > de67c47..bb836a3 100644
> > --- a/drivers/vfio/Makefile
> > +++ b/drivers/vfio/Makefile
> > @@ -3,6 +3,7 @@ vfio_virqfd-y := virqfd.o
> >
> >  obj-$(CONFIG_VFIO) += vfio.o
> >  obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
> > +obj-$(CONFIG_VFIO_PASID) += vfio_pasid.o
> >  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
> >  obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
> >  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o diff --git
> > a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c new file mode
> > 100644 index 000..66e6054e
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_pasid.c
> > @@ -0,0 +1,235 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (C) 2020 Intel Corporation.
> > + * Author: Liu Yi L 
> > + *
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#define DRIVER_VERSION  "0.1"
> > +#define DRIVER_AUTHOR   "Liu Yi L "
> > +#define DRIVER_DESC "PASID management for VFIO bus drivers"
> > +
> > +#define VFIO_DEFAULT_PASID_QUOTA   1

RE: [PATCH v5 04/15] vfio/type1: Report iommu nesting info to userspace

2020-07-20 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Saturday, July 18, 2020 1:34 AM
> 
> Yi,
> 
> On 7/12/20 1:20 PM, Liu Yi L wrote:
> > This patch exports iommu nesting capability info to user space through
> > VFIO. User space is expected to check this info for supported uAPIs (e.g.
> it is not only to check the supported uAPIS but rather to know which
> callbacks it must call upon vIOMMU events and which features are
> supported by the physical IOMMU.

yes, will refine the description per your comment.

> > PASID alloc/free, bind page table, and cache invalidation) and the vendor
> > specific format information for first level/stage page table that will be
> > bound to.
> >
> > The nesting info is available only after the nesting iommu type is set
> > for a container.
> to NESTED type

you mean "The nesting info is available only after container set to be NESTED 
type."

right?

>  Current implementation imposes one limitation - one
> > nesting container should include at most one group. The philosophy of
> > vfio container is having all groups/devices within the container share
> > the same IOMMU context. When vSVA is enabled, one IOMMU context could
> > include one 2nd-level address space and multiple 1st-level address spaces.
> > While the 2nd-level address space is reasonably sharable by multiple groups
> > , blindly sharing 1st-level address spaces across all groups within the
> > container might instead break the guest expectation. In the future sub/
> > super container concept might be introduced to allow partial address space
> > sharing within an IOMMU context. But for now let's go with this restriction
> > by requiring singleton container for using nesting iommu features. Below
> > link has the related discussion about this decision.
> 
> Maybe add a note about SMMU related changes spotted by Jean.

I guess you mean the comments Jean gave in patch 3/15, right? I'll
copy his comments and also give the below link as well.

https://lore.kernel.org/kvm/20200717090900.GC4850@myrica/

> >
> > https://lkml.org/lkml/2020/5/15/1028
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > ---
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > *) return struct iommu_nesting_info for
> VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as
> >cap is much "cheap", if needs extension in future, just define another 
> > cap.
> >https://lore.kernel.org/kvm/20200708132947.5b7ee...@x1.home/
> >
> > v3 -> v4:
> > *) address comments against v3.
> >
> > v1 -> v2:
> > *) added in v2
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 102
> +++-
> >  include/uapi/linux/vfio.h   |  19 
> >  2 files changed, 109 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c 
> > b/drivers/vfio/vfio_iommu_type1.c
> > index 3bd70ff..ed80104 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -62,18 +62,20 @@ MODULE_PARM_DESC(dma_entry_limit,
> >  "Maximum number of user DMA mappings per container (65535).");
> >
> >  struct vfio_iommu {
> > -   struct list_headdomain_list;
> > -   struct list_headiova_list;
> > -   struct vfio_domain  *external_domain; /* domain for external user */
> > -   struct mutexlock;
> > -   struct rb_root  dma_list;
> > -   struct blocking_notifier_head notifier;
> > -   unsigned intdma_avail;
> > -   uint64_tpgsize_bitmap;
> > -   boolv2;
> > -   boolnesting;
> > -   booldirty_page_tracking;
> > -   boolpinned_page_dirty_scope;
> > +   struct list_headdomain_list;
> > +   struct list_headiova_list;
> > +   /* domain for external user */
> > +   struct vfio_domain  *external_domain;
> > +   struct mutexlock;
> > +   struct rb_root  dma_list;
> > +   struct blocking_notifier_head   notifier;
> > +   unsigned intdma_avail;
> > +   uint64_tpgsize_bitmap;
> > +   boolv2;
> > +   boolnesting;
> > +   booldirty_page_tracking;
> > +   boo

RE: [PATCH v5 03/15] iommu/smmu: Report empty domain nesting info

2020-07-20 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> 
> Yi,
> 
> On 7/12/20 1:20 PM, Liu Yi L wrote:
> > This patch is added as instead of returning a boolean for
> > DOMAIN_ATTR_NESTING,
> > iommu_domain_get_attr() should return an iommu_nesting_info handle.
> 
> you may add in the commit message you return an empty nesting info struct for 
> now
> as true nesting is not yet supported by the SMMUs.

will do.

> Besides:
> Reviewed-by: Eric Auger 

thanks.

Regards,
Yi Liu

> Thanks
> 
> Eric
> >
> > Cc: Will Deacon 
> > Cc: Robin Murphy 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Suggested-by: Jean-Philippe Brucker 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Jacob Pan 
> > ---
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > ---
> >  drivers/iommu/arm-smmu-v3.c | 29 +++--
> >  drivers/iommu/arm-smmu.c| 29 +++--
> >  2 files changed, 54 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > index f578677..ec815d7 100644
> > --- a/drivers/iommu/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm-smmu-v3.c
> > @@ -3019,6 +3019,32 @@ static struct iommu_group
> *arm_smmu_device_group(struct device *dev)
> > return group;
> >  }
> >
> > +static int arm_smmu_domain_nesting_info(struct arm_smmu_domain
> *smmu_domain,
> > +   void *data)
> > +{
> > +   struct iommu_nesting_info *info = (struct iommu_nesting_info *)data;
> > +   unsigned int size;
> > +
> > +   if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> > +   return -ENODEV;
> > +
> > +   size = sizeof(struct iommu_nesting_info);
> > +
> > +   /*
> > +* if provided buffer size is smaller than expected, should
> > +* return 0 and also the expected buffer size to caller.
> > +*/
> > +   if (info->size < size) {
> > +   info->size = size;
> > +   return 0;
> > +   }
> > +
> > +   /* report an empty iommu_nesting_info for now */
> > +   memset(info, 0x0, size);
> > +   info->size = size;
> > +   return 0;
> > +}
> > +
> >  static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
> > enum iommu_attr attr, void *data)  { @@ -
> 3028,8 +3054,7 @@
> > static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
> > case IOMMU_DOMAIN_UNMANAGED:
> > switch (attr) {
> > case DOMAIN_ATTR_NESTING:
> > -   *(int *)data = (smmu_domain->stage ==
> ARM_SMMU_DOMAIN_NESTED);
> > -   return 0;
> > +   return arm_smmu_domain_nesting_info(smmu_domain,
> data);
> > default:
> > return -ENODEV;
> > }
> > diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index
> > 243bc4c..09e2f1b 100644
> > --- a/drivers/iommu/arm-smmu.c
> > +++ b/drivers/iommu/arm-smmu.c
> > @@ -1506,6 +1506,32 @@ static struct iommu_group
> *arm_smmu_device_group(struct device *dev)
> > return group;
> >  }
> >
> > +static int arm_smmu_domain_nesting_info(struct arm_smmu_domain
> *smmu_domain,
> > +   void *data)
> > +{
> > +   struct iommu_nesting_info *info = (struct iommu_nesting_info *)data;
> > +   unsigned int size;
> > +
> > +   if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> > +   return -ENODEV;
> > +
> > +   size = sizeof(struct iommu_nesting_info);
> > +
> > +   /*
> > +* if provided buffer size is smaller than expected, should
> > +* return 0 and also the expected buffer size to caller.
> > +*/
> > +   if (info->size < size) {
> > +   info->size = size;
> > +   return 0;
> > +   }
> > +
> > +   /* report an empty iommu_nesting_info for now */
> > +   memset(info, 0x0, size);
> > +   info->size = size;
> > +   return 0;
> > +}
> > +
> >  static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
> > enum iommu_attr attr, void *data)  { @@ -
> 1515,8 +1541,7 @@
> > static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
> > case IOMMU_DOMAIN_UNMANAGED:
> > switch (attr) {
> > case DOMAIN_ATTR_NESTING:
> > -   *(int *)data = (smmu_domain->stage ==
> ARM_SMMU_DOMAIN_NESTED);
> > -   return 0;
> > +   return arm_smmu_domain_nesting_info(smmu_domain,
> data);
> > default:
> > return -ENODEV;
> > }
> >



RE: [PATCH v5 02/15] iommu: Report domain nesting info

2020-07-20 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Saturday, July 18, 2020 12:29 AM
> 
> Hi Yi,
> 
> On 7/12/20 1:20 PM, Liu Yi L wrote:
> > IOMMUs that support nesting translation needs report the capability info
> s/needs/need to report

yep.

> > to userspace, e.g. the format of first level/stage paging structures.
> It gives information about requirements the userspace needs to implement
> plus other features characterizing the physical implementation.

got it. will add it in next version.

> >
> > This patch reports nesting info by DOMAIN_ATTR_NESTING. Caller can get
> > nesting info after setting DOMAIN_ATTR_NESTING.
> I guess you meant after selecting VFIO_TYPE1_NESTING_IOMMU?

yes, it is. ok, perhaps, it's better to say get nesting info after selecting
VFIO_TYPE1_NESTING_IOMMU.

> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Jacob Pan 
> > ---
> > v4 -> v5:
> > *) address comments from Eric Auger.
> >
> > v3 -> v4:
> > *) split the SMMU driver changes to be a separate patch
> > *) move the @addr_width and @pasid_bits from vendor specific
> >part to generic part.
> > *) tweak the description for the @features field of struct
> >iommu_nesting_info.
> > *) add description on the @data[] field of struct iommu_nesting_info
> >
> > v2 -> v3:
> > *) remvoe cap/ecap_mask in iommu_nesting_info.
> > *) reuse DOMAIN_ATTR_NESTING to get nesting info.
> > *) return an empty iommu_nesting_info for SMMU drivers per Jean'
> >suggestion.
> > ---
> >  include/uapi/linux/iommu.h | 77
> ++
> >  1 file changed, 77 insertions(+)
> >
> > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > index 1afc661..d2a47c4 100644
> > --- a/include/uapi/linux/iommu.h
> > +++ b/include/uapi/linux/iommu.h
> > @@ -332,4 +332,81 @@ struct iommu_gpasid_bind_data {
> > } vendor;
> >  };
> >
> > +/*
> > + * struct iommu_nesting_info - Information for nesting-capable IOMMU.
> > + *user space should check it before using
> > + *nesting capability.
> > + *
> > + * @size:  size of the whole structure
> > + * @format:PASID table entry format, the same definition as struct
> > + * iommu_gpasid_bind_data @format.
> > + * @features:  supported nesting features.
> > + * @flags: currently reserved for future extension.
> > + * @addr_width:The output addr width of first level/stage translation
> > + * @pasid_bits:Maximum supported PASID bits, 0 represents no PASID
> > + * support.
> > + * @data:  vendor specific cap info. data[] structure type can be deduced
> > + * from @format field.
> > + *
> > + *
> +===+===
> ===+
> > + * | feature   |  Notes   |
> > + *
> +===+===
> ===+
> > + * | SYSWIDE_PASID |  PASIDs are managed in system-wide, instead of per   |
> s/in system-wide/system-wide ?

got it.

> > + * |   |  device. When a device is assigned to userspace or   |
> > + * |   |  VM, proper uAPI (userspace driver framework uAPI,   |
> > + * |   |  e.g. VFIO) must be used to allocate/free PASIDs for |
> > + * |   |  the assigned device.
> Isn't it possible to be more explicit, something like:
>   |
> System-wide PASID management is mandated by the physical IOMMU. All
> PASIDs allocation must be mediated through the TBD API.

yep, I can add it.

> > + * +---+--+
> > + * | BIND_PGTBL|  The owner of the first level/stage page table must  |
> > + * |   |  explicitly bind the page table to associated PASID  |
> > + * |   |  (either the one specified in bind request or the|
> > + * |   |  default PASID of iommu domain), through userspace   |
> > + * |   |  driver framework uAPI (e.g. VFIO_IOMMU_NESTING_OP). |
> As per your answer in https://lkml.org/lkml/2020/7/6/383, I now
> understand ARM would not expose that BIND_PGTBL nesting feature,

yes, that's my point.

> I still
> think the above wording is a bit confusing. Maybe you may explicitly
> talk about the PASID *entr

RE: [PATCH v6 01/12] iommu: Change type of pasid to u32

2020-07-14 Thread Liu, Yi L
> From: Yu, Fenghua 
> Sent: Tuesday, July 14, 2020 9:55 PM
> On Mon, Jul 13, 2020 at 07:45:49PM -0700, Liu, Yi L wrote:
> > > From: Fenghua Yu 
> > > Sent: Tuesday, July 14, 2020 7:48 AM
> > >
> > > PASID is defined as a few different types in iommu including "int",
> > > "u32", and "unsigned int". To be consistent and to match with uapi
> > > definitions, define PASID and its variations (e.g. max PASID) as "u32".
> > > "u32" is also shorter and a little more explicit than "unsigned int".
> > >
> > > No PASID type change in uapi although it defines PASID as __u64 in
> > > some places.
> >
> > just out of curious, why not using ioasid_t? In Linux kernel, PASID is
> > managed by ioasid.
> 
> ioasid_t is only used in limited underneath files (ioasid.c and ioasid.h).
> Instead of changing hundreds of places to use ioasid_t, it's better to keep 
> ioasid_t
> only used in the files.
> 
> And it's explict and matches with uapi to define PASID as u32. Changing to 
> ioasid_t in
> so many files (amd, gpu, crypto, etc) may confuse upper users on "why 
> ioasid_t".
> 
> So we had better to explicitly define PASID as u32 and keep ioasid_t in the 
> limited
> underneath files.

fair enough, thanks,

Regards,
Yi Liu

> Thanks.
> 
> -Fenghua


RE: [PATCH v5 03/15] iommu/smmu: Report empty domain nesting info

2020-07-14 Thread Liu, Yi L
Hi Will,

> From: Will Deacon 
> Sent: Monday, July 13, 2020 9:15 PM
> 
> On Sun, Jul 12, 2020 at 04:20:58AM -0700, Liu Yi L wrote:
> > This patch is added as instead of returning a boolean for 
> > DOMAIN_ATTR_NESTING,
> > iommu_domain_get_attr() should return an iommu_nesting_info handle.
> >
> > Cc: Will Deacon 
> > Cc: Robin Murphy 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Suggested-by: Jean-Philippe Brucker 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Jacob Pan 
> > ---
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > ---
> >  drivers/iommu/arm-smmu-v3.c | 29 +++--
> >  drivers/iommu/arm-smmu.c| 29 +++--
> >  2 files changed, 54 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > index f578677..ec815d7 100644
> > --- a/drivers/iommu/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm-smmu-v3.c
> > @@ -3019,6 +3019,32 @@ static struct iommu_group
> *arm_smmu_device_group(struct device *dev)
> > return group;
> >  }
> >
> > +static int arm_smmu_domain_nesting_info(struct arm_smmu_domain
> *smmu_domain,
> > +   void *data)
> > +{
> > +   struct iommu_nesting_info *info = (struct iommu_nesting_info *)data;
> > +   unsigned int size;
> > +
> > +   if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> > +   return -ENODEV;
> > +
> > +   size = sizeof(struct iommu_nesting_info);
> > +
> > +   /*
> > +* if provided buffer size is smaller than expected, should
> > +* return 0 and also the expected buffer size to caller.
> > +*/
> > +   if (info->size < size) {
> > +   info->size = size;
> > +   return 0;
> > +   }
> > +
> > +   /* report an empty iommu_nesting_info for now */
> > +   memset(info, 0x0, size);
> > +   info->size = size;
> > +   return 0;
> > +}
> 
> Have you verified that this doesn't break the existing usage of
> DOMAIN_ATTR_NESTING in drivers/vfio/vfio_iommu_type1.c?

I didn't have ARM machine on my hand. But I contacted with Jean
Philippe, he confirmed no compiling issue. I didn't see any code
getting DOMAIN_ATTR_NESTING attr in current drivers/vfio/vfio_iommu_type1.c.
What I'm adding is to call iommu_domai_get_attr(, DOMAIN_ATTR_NESTIN)
and won't fail if the iommu_domai_get_attr() returns 0. This patch
returns an empty nesting info for DOMAIN_ATTR_NESTIN and return
value is 0 if no error. So I guess it won't fail nesting for ARM.

@Eric, how about your opinion? your dual-stage vSMMU support may
also share the vfio_iommu_type1.c code.

Regards,
Yi Liu

> Will


RE: [PATCH v6 03/12] docs: x86: Add documentation for SVA (Shared Virtual Addressing)

2020-07-13 Thread Liu, Yi L
> From: Fenghua Yu 
> Sent: Tuesday, July 14, 2020 7:48 AM
> 
> From: Ashok Raj 
> 
> ENQCMD and Data Streaming Accelerator (DSA) and all of their associated 
> features
> are a complicated stack with lots of interconnected pieces.
> This documentation provides a big picture overview for all of the features.
> 
> Signed-off-by: Ashok Raj 
> Co-developed-by: Fenghua Yu 
> Signed-off-by: Fenghua Yu 
> Reviewed-by: Tony Luck 
> ---
> v3:
> - Replace deprecated intel_svm_bind_mm() by iommu_sva_bind_mm() (Baolu)
> - Fix a couple of typos (Baolu)
> 
> v2:
> - Fix the doc format and add the doc in toctree (Thomas)
> - Modify the doc for better description (Thomas, Tony, Dave)
> 
>  Documentation/x86/index.rst |   1 +
>  Documentation/x86/sva.rst   | 287 
>  2 files changed, 288 insertions(+)
>  create mode 100644 Documentation/x86/sva.rst
> 
> diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index
> 265d9e9a093b..e5d5ff096685 100644
> --- a/Documentation/x86/index.rst
> +++ b/Documentation/x86/index.rst
> @@ -30,3 +30,4 @@ x86-specific Documentation
> usb-legacy-support
> i386/index
> x86_64/index
> +   sva
> diff --git a/Documentation/x86/sva.rst b/Documentation/x86/sva.rst new file 
> mode
> 100644 index ..7242a84169ef
> --- /dev/null
> +++ b/Documentation/x86/sva.rst
> @@ -0,0 +1,287 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===
> +Shared Virtual Addressing (SVA) with ENQCMD
> +===
> +
> +Background
> +==
> +
> +Shared Virtual Addressing (SVA) allows the processor and device to use
> +the same virtual addresses avoiding the need for software to translate
> +virtual addresses to physical addresses. SVA is what PCIe calls Shared
> +Virtual Memory (SVM)
> +
> +In addition to the convenience of using application virtual addresses
> +by the device, it also doesn't require pinning pages for DMA.
> +PCIe Address Translation Services (ATS) along with Page Request
> +Interface
> +(PRI) allow devices to function much the same way as the CPU handling
> +application page-faults. For more information please refer to PCIe
> +specification Chapter 10: ATS Specification.
> +

nit: may be helpful to mention Chapter 10 of PCIe spec since 4.0. before that, 
ATS has its
own specification.

> +Use of SVA requires IOMMU support in the platform. IOMMU also is
> +required to support PCIe features ATS and PRI. ATS allows devices to
> +cache translations for the virtual address. IOMMU driver uses the
> +mmu_notifier() support to keep the device tlb cache and the CPU cache
> +in sync. PRI allows the device to request paging the virtual address
> +before using if they are not paged in the CPU page tables.
> +
> +
> +Shared Hardware Workqueues
> +==
> +
> +Unlike Single Root I/O Virtualization (SRIOV), Scalable IOV (SIOV)
> +permits the use of Shared Work Queues (SWQ) by both applications and
> +Virtual Machines (VM's). This allows better hardware utilization vs.
> +hard partitioning resources that could result in under utilization. In
> +order to allow the hardware to distinguish the context for which work
> +is being executed in the hardware by SWQ interface, SIOV uses Process
> +Address Space ID (PASID), which is a 20bit number defined by the PCIe SIG.
> +
> +PASID value is encoded in all transactions from the device. This allows
> +the IOMMU to track I/O on a per-PASID granularity in addition to using
> +the PCIe Resource Identifier (RID) which is the Bus/Device/Function.
> +
> +
> +ENQCMD
> +==
> +
> +ENQCMD is a new instruction on Intel platforms that atomically submits
> +a work descriptor to a device. The descriptor includes the operation to
> +be performed, virtual addresses of all parameters, virtual address of a
> +completion record, and the PASID (process address space ID) of the current 
> process.
> +
> +ENQCMD works with non-posted semantics and carries a status back if the
> +command was accepted by hardware. This allows the submitter to know if
> +the submission needs to be retried or other device specific mechanisms
> +to implement fairness or ensure forward progress can be made.
> +
> +ENQCMD is the glue that ensures applications can directly submit
> +commands to the hardware and also permit hardware to be aware of
> +application context to perform I/O operations via use of PASID.
> +

maybe a reader will ask about ENQCMDs after reading ENQCMD/S spec. :-)

> +Process Address Space Tagging
> +=
> +
> +A new thread scoped MSR (IA32_PASID) provides the connection between
> +user processes and the rest of the hardware. When an application first
> +accesses an SVA capable device this MSR is initialized with a newly
> +allocated PASID. The driver for the device calls an IOMMU specific api
> +that sets up the routing for DMA and page-requests.
> +
> +For example, the Intel Data Streaming 

RE: [PATCH v6 01/12] iommu: Change type of pasid to u32

2020-07-13 Thread Liu, Yi L
> From: Fenghua Yu 
> Sent: Tuesday, July 14, 2020 7:48 AM
> 
> PASID is defined as a few different types in iommu including "int",
> "u32", and "unsigned int". To be consistent and to match with uapi
> definitions, define PASID and its variations (e.g. max PASID) as "u32".
> "u32" is also shorter and a little more explicit than "unsigned int".
> 
> No PASID type change in uapi although it defines PASID as __u64 in
> some places.

just out of curious, why not using ioasid_t? In Linux kernel, PASID is
managed by ioasid.

Regards,
Yi Liu

> Suggested-by: Thomas Gleixner 
> Signed-off-by: Fenghua Yu 
> Reviewed-by: Tony Luck 
> Reviewed-by: Lu Baolu 
> Acked-by: Felix Kuehling 
> ---
> v6:
> - Change return type to u32 for kfd_pasid_alloc() (Felix)
> 
> v5:
> - Reviewed by Lu Baolu
> 
> v4:
> - Change PASID type from "unsigned int" to "u32" (Christoph)
> 
> v2:
> - Create this new patch to define PASID as "unsigned int" consistently in
>   iommu (Thomas)
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  4 +--
>  .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c|  2 +-
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v7.c |  2 +-
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v8.c |  2 +-
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c |  2 +-
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.h |  2 +-
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  4 +--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c   |  6 ++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ids.h   |  4 +--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c   |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c|  8 ++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h|  8 ++---
>  .../gpu/drm/amd/amdkfd/cik_event_interrupt.c  |  2 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_dbgdev.c   |  2 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_dbgmgr.h   |  2 +-
>  .../drm/amd/amdkfd/kfd_device_queue_manager.c |  7 ++---
>  drivers/gpu/drm/amd/amdkfd/kfd_events.c   |  8 ++---
>  drivers/gpu/drm/amd/amdkfd/kfd_events.h   |  4 +--
>  drivers/gpu/drm/amd/amdkfd/kfd_iommu.c|  6 ++--
>  drivers/gpu/drm/amd/amdkfd/kfd_pasid.c|  4 +--
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 20 ++--
>  drivers/gpu/drm/amd/amdkfd/kfd_process.c  |  2 +-
>  .../gpu/drm/amd/include/kgd_kfd_interface.h   |  2 +-
>  drivers/iommu/amd/amd_iommu.h | 10 +++---
>  drivers/iommu/amd/iommu.c | 31 ++-
>  drivers/iommu/amd/iommu_v2.c  | 20 ++--
>  drivers/iommu/intel/dmar.c|  7 +++--
>  drivers/iommu/intel/intel-pasid.h | 24 +++---
>  drivers/iommu/intel/iommu.c   |  4 +--
>  drivers/iommu/intel/pasid.c   | 31 +--
>  drivers/iommu/intel/svm.c | 12 +++
>  drivers/iommu/iommu.c |  2 +-
>  drivers/misc/uacce/uacce.c|  2 +-
>  include/linux/amd-iommu.h |  8 ++---
>  include/linux/intel-iommu.h   | 12 +++
>  include/linux/intel-svm.h |  2 +-
>  include/linux/iommu.h | 10 +++---
>  include/linux/uacce.h |  2 +-
>  38 files changed, 141 insertions(+), 141 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index ffe149aafc39..dfef5a7e0f5a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -207,11 +207,11 @@ uint8_t amdgpu_amdkfd_get_xgmi_hops_count(struct
> kgd_dev *dst, struct kgd_dev *s
>   })
> 
>  /* GPUVM API */
> -int amdgpu_amdkfd_gpuvm_create_process_vm(struct kgd_dev *kgd, unsigned int
> pasid,
> +int amdgpu_amdkfd_gpuvm_create_process_vm(struct kgd_dev *kgd, u32 pasid,
>   void **vm, void **process_info,
>   struct dma_fence **ef);
>  int amdgpu_amdkfd_gpuvm_acquire_process_vm(struct kgd_dev *kgd,
> - struct file *filp, unsigned int pasid,
> + struct file *filp, u32 pasid,
>   void **vm, void **process_info,
>   struct dma_fence **ef);
>  void amdgpu_amdkfd_gpuvm_destroy_cb(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c
> index bf927f432506..ee531c3988d1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c
> @@ -105,7 +105,7 @@ static void kgd_program_sh_mem_settings(struct kgd_dev
> *kgd, uint32_t vmid,
>   unlock_srbm(kgd);
>  }
> 
> -static int kgd_set_pasid_vmid_mapping(struct kgd_dev *kgd, unsigned int 
> pasid,
> +static int kgd_set_pasid_vmid_mapping(struct kgd_dev *kgd, u32 pasid,
> 

[PATCH v5 02/15] iommu: Report domain nesting info

2020-07-12 Thread Liu Yi L
IOMMUs that support nesting translation needs report the capability info
to userspace, e.g. the format of first level/stage paging structures.

This patch reports nesting info by DOMAIN_ATTR_NESTING. Caller can get
nesting info after setting DOMAIN_ATTR_NESTING.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
Signed-off-by: Jacob Pan 
---
v4 -> v5:
*) address comments from Eric Auger.

v3 -> v4:
*) split the SMMU driver changes to be a separate patch
*) move the @addr_width and @pasid_bits from vendor specific
   part to generic part.
*) tweak the description for the @features field of struct
   iommu_nesting_info.
*) add description on the @data[] field of struct iommu_nesting_info

v2 -> v3:
*) remvoe cap/ecap_mask in iommu_nesting_info.
*) reuse DOMAIN_ATTR_NESTING to get nesting info.
*) return an empty iommu_nesting_info for SMMU drivers per Jean'
   suggestion.
---
 include/uapi/linux/iommu.h | 77 ++
 1 file changed, 77 insertions(+)

diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 1afc661..d2a47c4 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -332,4 +332,81 @@ struct iommu_gpasid_bind_data {
} vendor;
 };
 
+/*
+ * struct iommu_nesting_info - Information for nesting-capable IOMMU.
+ *user space should check it before using
+ *nesting capability.
+ *
+ * @size:  size of the whole structure
+ * @format:PASID table entry format, the same definition as struct
+ * iommu_gpasid_bind_data @format.
+ * @features:  supported nesting features.
+ * @flags: currently reserved for future extension.
+ * @addr_width:The output addr width of first level/stage translation
+ * @pasid_bits:Maximum supported PASID bits, 0 represents no PASID
+ * support.
+ * @data:  vendor specific cap info. data[] structure type can be deduced
+ * from @format field.
+ *
+ * +===+==+
+ * | feature   |  Notes   |
+ * +===+==+
+ * | SYSWIDE_PASID |  PASIDs are managed in system-wide, instead of per   |
+ * |   |  device. When a device is assigned to userspace or   |
+ * |   |  VM, proper uAPI (userspace driver framework uAPI,   |
+ * |   |  e.g. VFIO) must be used to allocate/free PASIDs for |
+ * |   |  the assigned device.|
+ * +---+--+
+ * | BIND_PGTBL|  The owner of the first level/stage page table must  |
+ * |   |  explicitly bind the page table to associated PASID  |
+ * |   |  (either the one specified in bind request or the|
+ * |   |  default PASID of iommu domain), through userspace   |
+ * |   |  driver framework uAPI (e.g. VFIO_IOMMU_NESTING_OP). |
+ * +---+--+
+ * | CACHE_INVLD   |  The owner of the first level/stage page table must  |
+ * |   |  explicitly invalidate the IOMMU cache through uAPI  |
+ * |   |  provided by userspace driver framework (e.g. VFIO)  |
+ * |   |  according to vendor-specific requirement when   |
+ * |   |  changing the page table.|
+ * +---+--+
+ *
+ * @data[] types defined for @format:
+ * ++=+
+ * | @format| @data[] |
+ * ++=+
+ * | IOMMU_PASID_FORMAT_INTEL_VTD   | struct iommu_nesting_info_vtd   |
+ * ++-+
+ *
+ */
+struct iommu_nesting_info {
+   __u32   size;
+   __u32   format;
+#define IOMMU_NESTING_FEAT_SYSWIDE_PASID   (1 << 0)
+#define IOMMU_NESTING_FEAT_BIND_PGTBL  (1 << 1)
+#define IOMMU_NESTING_FEAT_CACHE_INVLD (1 << 2)
+   __u32   features;
+   __u32   flags;
+   __u16   addr_width;
+   __u16   pasid_bits;
+   __u32   padding;
+   __u8data[];
+};
+
+/*
+ * struct iommu_nesting_info_vtd - Intel VT-d specific nesting info
+ *
+ * @flags: VT-d specific flags. Currently reserved for future
+ * extension.
+ * @cap_reg:   Describe basic capabilities as defined in VT-d capability
+ * register.
+ * @ecap_reg:  Describe the extended capabilities as defined in VT-d
+ * extended capability register.
+ */
+st

[PATCH v5 06/15] iommu/vt-d: Support setting ioasid set to domain

2020-07-12 Thread Liu Yi L
>From IOMMU p.o.v., PASIDs allocated and managed by external components
(e.g. VFIO) will be passed in for gpasid_bind/unbind operation. IOMMU
needs some knowledge to check the PASID ownership, hence add an interface
for those components to tell the PASID owner.

In latest kernel design, PASID ownership is managed by IOASID set where
the PASID is allocated from. This patch adds support for setting ioasid
set ID to the domains used for nesting/vSVA. Subsequent SVA operations
on the PASID will be checked against its IOASID set for proper ownership.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
Signed-off-by: Jacob Pan 
---
v4 -> v5:
*) address comments from Eric Auger.
---
 drivers/iommu/intel/iommu.c | 22 ++
 include/linux/intel-iommu.h |  4 
 include/linux/iommu.h   |  1 +
 3 files changed, 27 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 72ae6a2..4d54198 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1793,6 +1793,7 @@ static struct dmar_domain *alloc_domain(int flags)
if (first_level_by_default())
domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
domain->has_iotlb_device = false;
+   domain->ioasid_sid = INVALID_IOASID_SET;
INIT_LIST_HEAD(>devices);
 
return domain;
@@ -6039,6 +6040,27 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain,
}
spin_unlock_irqrestore(_domain_lock, flags);
break;
+   case DOMAIN_ATTR_IOASID_SID:
+   {
+   int sid = *(int *)data;
+
+   if (!(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE)) {
+   ret = -ENODEV;
+   break;
+   }
+   spin_lock_irqsave(_domain_lock, flags);
+   if (dmar_domain->ioasid_sid != INVALID_IOASID_SET &&
+   dmar_domain->ioasid_sid != sid) {
+   pr_warn_ratelimited("multi ioasid_set (%d:%d) setting",
+   dmar_domain->ioasid_sid, sid);
+   ret = -EBUSY;
+   spin_unlock_irqrestore(_domain_lock, flags);
+   break;
+   }
+   dmar_domain->ioasid_sid = sid;
+   spin_unlock_irqrestore(_domain_lock, flags);
+   break;
+   }
default:
ret = -EINVAL;
break;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 3f23c26..0d0ab32 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -549,6 +549,10 @@ struct dmar_domain {
   2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
u64 max_addr;   /* maximum mapped address */
 
+   int ioasid_sid; /*
+* the ioasid set which tracks all
+* PASIDs used by the domain.
+*/
int default_pasid;  /*
 * The default pasid used for non-SVM
 * traffic on mediated devices.
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 7ca9d48..e84a1d5 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -124,6 +124,7 @@ enum iommu_attr {
DOMAIN_ATTR_FSL_PAMUV1,
DOMAIN_ATTR_NESTING,/* two stages of translation */
DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE,
+   DOMAIN_ATTR_IOASID_SID,
DOMAIN_ATTR_MAX,
 };
 
-- 
2.7.4



[PATCH v5 09/15] iommu/vt-d: Check ownership for PASIDs from user-space

2020-07-12 Thread Liu Yi L
When an IOMMU domain with nesting attribute is used for guest SVA, a
system-wide PASID is allocated for binding with the device and the domain.
For security reason, we need to check the PASID passsed from user-space.
e.g. page table bind/unbind and PASID related cache invalidation.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
Signed-off-by: Jacob Pan 
---
 drivers/iommu/intel/iommu.c | 10 ++
 drivers/iommu/intel/svm.c   |  7 +--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 4d54198..a9504cb 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5436,6 +5436,7 @@ intel_iommu_sva_invalidate(struct iommu_domain *domain, 
struct device *dev,
int granu = 0;
u64 pasid = 0;
u64 addr = 0;
+   void *pdata;
 
granu = to_vtd_granularity(cache_type, inv_info->granularity);
if (granu == -EINVAL) {
@@ -5456,6 +5457,15 @@ intel_iommu_sva_invalidate(struct iommu_domain *domain, 
struct device *dev,
 (inv_info->granu.addr_info.flags & 
IOMMU_INV_ADDR_FLAGS_PASID))
pasid = inv_info->granu.addr_info.pasid;
 
+   pdata = ioasid_find(dmar_domain->ioasid_sid, pasid, NULL);
+   if (!pdata) {
+   ret = -EINVAL;
+   goto out_unlock;
+   } else if (IS_ERR(pdata)) {
+   ret = PTR_ERR(pdata);
+   goto out_unlock;
+   }
+
switch (BIT(cache_type)) {
case IOMMU_CACHE_INV_TYPE_IOTLB:
/* HW will ignore LSB bits based on address mask */
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index d2c0e1a..212dee0 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -319,7 +319,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
struct device *dev,
dmar_domain = to_dmar_domain(domain);
 
mutex_lock(_mutex);
-   svm = ioasid_find(INVALID_IOASID_SET, data->hpasid, NULL);
+   svm = ioasid_find(dmar_domain->ioasid_sid, data->hpasid, NULL);
if (IS_ERR(svm)) {
ret = PTR_ERR(svm);
goto out;
@@ -436,6 +436,7 @@ int intel_svm_unbind_gpasid(struct iommu_domain *domain,
struct device *dev, ioasid_t pasid)
 {
struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
+   struct dmar_domain *dmar_domain;
struct intel_svm_dev *sdev;
struct intel_svm *svm;
int ret = -EINVAL;
@@ -443,8 +444,10 @@ int intel_svm_unbind_gpasid(struct iommu_domain *domain,
if (WARN_ON(!iommu))
return -EINVAL;
 
+   dmar_domain = to_dmar_domain(domain);
+
mutex_lock(_mutex);
-   svm = ioasid_find(INVALID_IOASID_SET, pasid, NULL);
+   svm = ioasid_find(dmar_domain->ioasid_sid, pasid, NULL);
if (!svm) {
ret = -EINVAL;
goto out;
-- 
2.7.4



[PATCH v5 07/15] vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free)

2020-07-12 Thread Liu Yi L
This patch allows user space to request PASID allocation/free, e.g. when
serving the request from the guest.

PASIDs that are not freed by userspace are automatically freed when the
IOASID set is destroyed when process exits.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
Signed-off-by: Yi Sun 
Signed-off-by: Jacob Pan 
---
v4 -> v5:
*) address comments from Eric Auger.
*) the comments for the PASID_FREE request is addressed in patch 5/15 of
   this series.

v3 -> v4:
*) address comments from v3, except the below comment against the range
   of PASID_FREE request. needs more help on it.
"> +if (req.range.min > req.range.max)

Is it exploitable that a user can spin the kernel for a long time in
the case of a free by calling this with [0, MAX_UINT] regardless of
their actual allocations?"
https://lore.kernel.org/linux-iommu/20200702151832.048b4...@x1.home/

v1 -> v2:
*) move the vfio_mm related code to be a seprate module
*) use a single structure for alloc/free, could support a range of PASIDs
*) fetch vfio_mm at group_attach time instead of at iommu driver open time
---
 drivers/vfio/Kconfig|  1 +
 drivers/vfio/vfio_iommu_type1.c | 85 +
 drivers/vfio/vfio_pasid.c   | 10 +
 include/linux/vfio.h|  6 +++
 include/uapi/linux/vfio.h   | 37 ++
 5 files changed, 139 insertions(+)

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 3d8a108..95d90c6 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -2,6 +2,7 @@
 config VFIO_IOMMU_TYPE1
tristate
depends on VFIO
+   select VFIO_PASID if (X86)
default n
 
 config VFIO_IOMMU_SPAPR_TCE
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index ed80104..55b4065 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -76,6 +76,7 @@ struct vfio_iommu {
booldirty_page_tracking;
boolpinned_page_dirty_scope;
struct iommu_nesting_info   *nesting_info;
+   struct vfio_mm  *vmm;
 };
 
 struct vfio_domain {
@@ -1937,6 +1938,11 @@ static void vfio_iommu_iova_insert_copy(struct 
vfio_iommu *iommu,
 
 static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
 {
+   if (iommu->vmm) {
+   vfio_mm_put(iommu->vmm);
+   iommu->vmm = NULL;
+   }
+
kfree(iommu->nesting_info);
iommu->nesting_info = NULL;
 }
@@ -2071,6 +2077,26 @@ static int vfio_iommu_type1_attach_group(void 
*iommu_data,
iommu->nesting_info);
if (ret)
goto out_detach;
+
+   if (iommu->nesting_info->features &
+   IOMMU_NESTING_FEAT_SYSWIDE_PASID) {
+   struct vfio_mm *vmm;
+   int sid;
+
+   vmm = vfio_mm_get_from_task(current);
+   if (IS_ERR(vmm)) {
+   ret = PTR_ERR(vmm);
+   goto out_detach;
+   }
+   iommu->vmm = vmm;
+
+   sid = vfio_mm_ioasid_sid(vmm);
+   ret = iommu_domain_set_attr(domain->domain,
+   DOMAIN_ATTR_IOASID_SID,
+   );
+   if (ret)
+   goto out_detach;
+   }
}
 
/* Get aperture info */
@@ -2855,6 +2881,63 @@ static int vfio_iommu_type1_dirty_pages(struct 
vfio_iommu *iommu,
return -EINVAL;
 }
 
+static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
+   unsigned int min,
+   unsigned int max)
+{
+   int ret = -EOPNOTSUPP;
+
+   mutex_lock(>lock);
+   if (iommu->vmm)
+   ret = vfio_pasid_alloc(iommu->vmm, min, max);
+   mutex_unlock(>lock);
+   return ret;
+}
+
+static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
+  unsigned int min,
+  unsigned int max)
+{
+   int ret = -EOPNOTSUPP;
+
+   mutex_lock(>lock);
+   if (iommu->vmm) {
+   vfio_pasid_free_range(iommu->vmm, min, max);
+   ret = 0;
+   }
+   mutex_unlock(>lock);
+   return ret;
+}
+
+static int vfio_iommu_type1_pasid_request(struct vfio_iommu *iommu,
+ unsigned long arg)
+{
+   struct vfio_iommu_type1_pasid_request req;
+   unsigned long minsz;
+
+   minsz = offs

[PATCH v5 03/15] iommu/smmu: Report empty domain nesting info

2020-07-12 Thread Liu Yi L
This patch is added as instead of returning a boolean for DOMAIN_ATTR_NESTING,
iommu_domain_get_attr() should return an iommu_nesting_info handle.

Cc: Will Deacon 
Cc: Robin Murphy 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Suggested-by: Jean-Philippe Brucker 
Signed-off-by: Liu Yi L 
Signed-off-by: Jacob Pan 
---
v4 -> v5:
*) address comments from Eric Auger.
---
 drivers/iommu/arm-smmu-v3.c | 29 +++--
 drivers/iommu/arm-smmu.c| 29 +++--
 2 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677..ec815d7 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -3019,6 +3019,32 @@ static struct iommu_group *arm_smmu_device_group(struct 
device *dev)
return group;
 }
 
+static int arm_smmu_domain_nesting_info(struct arm_smmu_domain *smmu_domain,
+   void *data)
+{
+   struct iommu_nesting_info *info = (struct iommu_nesting_info *)data;
+   unsigned int size;
+
+   if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+   return -ENODEV;
+
+   size = sizeof(struct iommu_nesting_info);
+
+   /*
+* if provided buffer size is smaller than expected, should
+* return 0 and also the expected buffer size to caller.
+*/
+   if (info->size < size) {
+   info->size = size;
+   return 0;
+   }
+
+   /* report an empty iommu_nesting_info for now */
+   memset(info, 0x0, size);
+   info->size = size;
+   return 0;
+}
+
 static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
enum iommu_attr attr, void *data)
 {
@@ -3028,8 +3054,7 @@ static int arm_smmu_domain_get_attr(struct iommu_domain 
*domain,
case IOMMU_DOMAIN_UNMANAGED:
switch (attr) {
case DOMAIN_ATTR_NESTING:
-   *(int *)data = (smmu_domain->stage == 
ARM_SMMU_DOMAIN_NESTED);
-   return 0;
+   return arm_smmu_domain_nesting_info(smmu_domain, data);
default:
return -ENODEV;
}
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 243bc4c..09e2f1b 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1506,6 +1506,32 @@ static struct iommu_group *arm_smmu_device_group(struct 
device *dev)
return group;
 }
 
+static int arm_smmu_domain_nesting_info(struct arm_smmu_domain *smmu_domain,
+   void *data)
+{
+   struct iommu_nesting_info *info = (struct iommu_nesting_info *)data;
+   unsigned int size;
+
+   if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+   return -ENODEV;
+
+   size = sizeof(struct iommu_nesting_info);
+
+   /*
+* if provided buffer size is smaller than expected, should
+* return 0 and also the expected buffer size to caller.
+*/
+   if (info->size < size) {
+   info->size = size;
+   return 0;
+   }
+
+   /* report an empty iommu_nesting_info for now */
+   memset(info, 0x0, size);
+   info->size = size;
+   return 0;
+}
+
 static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
enum iommu_attr attr, void *data)
 {
@@ -1515,8 +1541,7 @@ static int arm_smmu_domain_get_attr(struct iommu_domain 
*domain,
case IOMMU_DOMAIN_UNMANAGED:
switch (attr) {
case DOMAIN_ATTR_NESTING:
-   *(int *)data = (smmu_domain->stage == 
ARM_SMMU_DOMAIN_NESTED);
-   return 0;
+   return arm_smmu_domain_nesting_info(smmu_domain, data);
default:
return -ENODEV;
}
-- 
2.7.4



[PATCH v5 11/15] vfio/type1: Allow invalidating first-level/stage IOMMU cache

2020-07-12 Thread Liu Yi L
This patch provides an interface allowing the userspace to invalidate
IOMMU cache for first-level page table. It is required when the first
level IOMMU page table is not managed by the host kernel in the nested
translation setup.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
Signed-off-by: Eric Auger 
Signed-off-by: Jacob Pan 
---
v1 -> v2:
*) rename from "vfio/type1: Flush stage-1 IOMMU cache for nesting type"
*) rename vfio_cache_inv_fn() to vfio_dev_cache_invalidate_fn()
*) vfio_dev_cache_inv_fn() always successful
*) remove VFIO_IOMMU_CACHE_INVALIDATE, and reuse VFIO_IOMMU_NESTING_OP
---
 drivers/vfio/vfio_iommu_type1.c | 50 +
 include/uapi/linux/vfio.h   |  3 +++
 2 files changed, 53 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index f0f21ff..960cc59 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -3073,6 +3073,53 @@ static long vfio_iommu_handle_pgtbl_op(struct vfio_iommu 
*iommu,
return ret;
 }
 
+static int vfio_dev_cache_invalidate_fn(struct device *dev, void *data)
+{
+   struct domain_capsule *dc = (struct domain_capsule *)data;
+   unsigned long arg = *(unsigned long *)dc->data;
+
+   iommu_cache_invalidate(dc->domain, dev, (void __user *)arg);
+   return 0;
+}
+
+static long vfio_iommu_invalidate_cache(struct vfio_iommu *iommu,
+   unsigned long arg)
+{
+   struct domain_capsule dc = { .data =  };
+   struct vfio_group *group;
+   struct vfio_domain *domain;
+   int ret = 0;
+   struct iommu_nesting_info *info;
+
+   mutex_lock(>lock);
+   /*
+* Cache invalidation is required for any nesting IOMMU,
+* so no need to check system-wide PASID support.
+*/
+   info = iommu->nesting_info;
+   if (!info || !(info->features & IOMMU_NESTING_FEAT_CACHE_INVLD)) {
+   ret = -EOPNOTSUPP;
+   goto out_unlock;
+   }
+
+   group = vfio_find_nesting_group(iommu);
+   if (!group) {
+   ret = -EINVAL;
+   goto out_unlock;
+   }
+
+   domain = list_first_entry(>domain_list,
+ struct vfio_domain, next);
+   dc.group = group;
+   dc.domain = domain->domain;
+   iommu_group_for_each_dev(group->iommu_group, ,
+vfio_dev_cache_invalidate_fn);
+
+out_unlock:
+   mutex_unlock(>lock);
+   return ret;
+}
+
 static long vfio_iommu_type1_nesting_op(struct vfio_iommu *iommu,
unsigned long arg)
 {
@@ -3095,6 +3142,9 @@ static long vfio_iommu_type1_nesting_op(struct vfio_iommu 
*iommu,
case VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL:
ret = vfio_iommu_handle_pgtbl_op(iommu, false, arg + minsz);
break;
+   case VFIO_IOMMU_NESTING_OP_CACHE_INVLD:
+   ret = vfio_iommu_invalidate_cache(iommu, arg + minsz);
+   break;
default:
ret = -EINVAL;
}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index a8ad786..845a5800 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1225,6 +1225,8 @@ struct vfio_iommu_type1_pasid_request {
  * +-+---+
  * | UNBIND_PGTBL|  struct iommu_gpasid_bind_data|
  * +-+---+
+ * | CACHE_INVLD |  struct iommu_cache_invalidate_info   |
+ * +-+---+
  *
  * returns: 0 on success, -errno on failure.
  */
@@ -1237,6 +1239,7 @@ struct vfio_iommu_type1_nesting_op {
 
 #define VFIO_IOMMU_NESTING_OP_BIND_PGTBL   (0)
 #define VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL (1)
+#define VFIO_IOMMU_NESTING_OP_CACHE_INVLD  (2)
 
 #define VFIO_IOMMU_NESTING_OP  _IO(VFIO_TYPE, VFIO_BASE + 19)
 
-- 
2.7.4



[PATCH v5 12/15] vfio/type1: Add vSVA support for IOMMU-backed mdevs

2020-07-12 Thread Liu Yi L
Recent years, mediated device pass-through framework (e.g. vfio-mdev)
is used to achieve flexible device sharing across domains (e.g. VMs).
Also there are hardware assisted mediated pass-through solutions from
platform vendors. e.g. Intel VT-d scalable mode which supports Intel
Scalable I/O Virtualization technology. Such mdevs are called IOMMU-
backed mdevs as there are IOMMU enforced DMA isolation for such mdevs.
In kernel, IOMMU-backed mdevs are exposed to IOMMU layer by aux-domain
concept, which means mdevs are protected by an iommu domain which is
auxiliary to the domain that the kernel driver primarily uses for DMA
API. Details can be found in the KVM presentation as below:

https://events19.linuxfoundation.org/wp-content/uploads/2017/12/\
Hardware-Assisted-Mediated-Pass-Through-with-VFIO-Kevin-Tian-Intel.pdf

This patch extends NESTING_IOMMU ops to IOMMU-backed mdev devices. The
main requirement is to use the auxiliary domain associated with mdev.

Cc: Kevin Tian 
CC: Jacob Pan 
CC: Jun Tian 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
---
v1 -> v2:
*) check the iommu_device to ensure the handling mdev is IOMMU-backed
---
 drivers/vfio/vfio_iommu_type1.c | 39 +++
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 960cc59..f1f1ae2 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2373,20 +2373,41 @@ static int vfio_iommu_resv_refresh(struct vfio_iommu 
*iommu,
return ret;
 }
 
+static struct device *vfio_get_iommu_device(struct vfio_group *group,
+   struct device *dev)
+{
+   if (group->mdev_group)
+   return vfio_mdev_get_iommu_device(dev);
+   else
+   return dev;
+}
+
 static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
 {
struct domain_capsule *dc = (struct domain_capsule *)data;
unsigned long arg = *(unsigned long *)dc->data;
+   struct device *iommu_device;
+
+   iommu_device = vfio_get_iommu_device(dc->group, dev);
+   if (!iommu_device)
+   return -EINVAL;
 
-   return iommu_sva_bind_gpasid(dc->domain, dev, (void __user *)arg);
+   return iommu_sva_bind_gpasid(dc->domain, iommu_device,
+(void __user *)arg);
 }
 
 static int vfio_dev_unbind_gpasid_fn(struct device *dev, void *data)
 {
struct domain_capsule *dc = (struct domain_capsule *)data;
unsigned long arg = *(unsigned long *)dc->data;
+   struct device *iommu_device;
 
-   iommu_sva_unbind_gpasid(dc->domain, dev, (void __user *)arg);
+   iommu_device = vfio_get_iommu_device(dc->group, dev);
+   if (!iommu_device)
+   return -EINVAL;
+
+   iommu_sva_unbind_gpasid(dc->domain, iommu_device,
+   (void __user *)arg);
return 0;
 }
 
@@ -2395,8 +2416,13 @@ static int __vfio_dev_unbind_gpasid_fn(struct device 
*dev, void *data)
struct domain_capsule *dc = (struct domain_capsule *)data;
struct iommu_gpasid_bind_data *unbind_data =
(struct iommu_gpasid_bind_data *)dc->data;
+   struct device *iommu_device;
+
+   iommu_device = vfio_get_iommu_device(dc->group, dev);
+   if (!iommu_device)
+   return -EINVAL;
 
-   __iommu_sva_unbind_gpasid(dc->domain, dev, unbind_data);
+   __iommu_sva_unbind_gpasid(dc->domain, iommu_device, unbind_data);
return 0;
 }
 
@@ -3077,8 +3103,13 @@ static int vfio_dev_cache_invalidate_fn(struct device 
*dev, void *data)
 {
struct domain_capsule *dc = (struct domain_capsule *)data;
unsigned long arg = *(unsigned long *)dc->data;
+   struct device *iommu_device;
+
+   iommu_device = vfio_get_iommu_device(dc->group, dev);
+   if (!iommu_device)
+   return -EINVAL;
 
-   iommu_cache_invalidate(dc->domain, dev, (void __user *)arg);
+   iommu_cache_invalidate(dc->domain, iommu_device, (void __user *)arg);
return 0;
 }
 
-- 
2.7.4



[PATCH v5 13/15] vfio/pci: Expose PCIe PASID capability to guest

2020-07-12 Thread Liu Yi L
This patch exposes PCIe PASID capability to guest for assigned devices.
Existing vfio_pci driver hides it from guest by setting the capability
length as 0 in pci_ext_cap_length[].

And this patch only exposes PASID capability for devices which has PCIe
PASID extended struture in its configuration space. So VFs, will will
not see PASID capability on VFs as VF doesn't implement PASID extended
structure in its configuration space. For VF, it is a TODO in future.
Related discussion can be found in below link:

https://lkml.org/lkml/2020/4/7/693

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
---
v1 -> v2:
*) added in v2, but it was sent in a separate patchseries before
---
 drivers/vfio/pci/vfio_pci_config.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index d98843f..07ff2e6 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -95,7 +95,7 @@ static const u16 pci_ext_cap_length[PCI_EXT_CAP_ID_MAX + 1] = 
{
[PCI_EXT_CAP_ID_LTR]=   PCI_EXT_CAP_LTR_SIZEOF,
[PCI_EXT_CAP_ID_SECPCI] =   0,  /* not yet */
[PCI_EXT_CAP_ID_PMUX]   =   0,  /* not yet */
-   [PCI_EXT_CAP_ID_PASID]  =   0,  /* not yet */
+   [PCI_EXT_CAP_ID_PASID]  =   PCI_EXT_CAP_PASID_SIZEOF,
 };
 
 /*
-- 
2.7.4



[PATCH v5 15/15] iommu/vt-d: Support reporting nesting capability info

2020-07-12 Thread Liu Yi L
Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
Signed-off-by: Jacob Pan 
---
v2 -> v3:
*) remove cap/ecap_mask in iommu_nesting_info.
---
 drivers/iommu/intel/iommu.c | 81 +++--
 include/linux/intel-iommu.h | 16 +
 2 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index a9504cb..9f7ad1a 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5659,12 +5659,16 @@ static inline bool iommu_pasid_support(void)
 static inline bool nested_mode_support(void)
 {
struct dmar_drhd_unit *drhd;
-   struct intel_iommu *iommu;
+   struct intel_iommu *iommu, *prev = NULL;
bool ret = true;
 
rcu_read_lock();
for_each_active_iommu(iommu, drhd) {
-   if (!sm_supported(iommu) || !ecap_nest(iommu->ecap)) {
+   if (!prev)
+   prev = iommu;
+   if (!sm_supported(iommu) || !ecap_nest(iommu->ecap) ||
+   (VTD_CAP_MASK & (iommu->cap ^ prev->cap)) ||
+   (VTD_ECAP_MASK & (iommu->ecap ^ prev->ecap))) {
ret = false;
break;
}
@@ -6079,6 +6083,78 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain,
return ret;
 }
 
+static int intel_iommu_get_nesting_info(struct iommu_domain *domain,
+   struct iommu_nesting_info *info)
+{
+   struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+   u64 cap = VTD_CAP_MASK, ecap = VTD_ECAP_MASK;
+   struct device_domain_info *domain_info;
+   struct iommu_nesting_info_vtd vtd;
+   unsigned long flags;
+   unsigned int size;
+
+   if (domain->type != IOMMU_DOMAIN_UNMANAGED ||
+   !(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE))
+   return -ENODEV;
+
+   if (!info)
+   return -EINVAL;
+
+   size = sizeof(struct iommu_nesting_info) +
+   sizeof(struct iommu_nesting_info_vtd);
+   /*
+* if provided buffer size is smaller than expected, should
+* return 0 and also the expected buffer size to caller.
+*/
+   if (info->size < size) {
+   info->size = size;
+   return 0;
+   }
+
+   spin_lock_irqsave(_domain_lock, flags);
+   /*
+* arbitrary select the first domain_info as all nesting
+* related capabilities should be consistent across iommu
+* units.
+*/
+   domain_info = list_first_entry(_domain->devices,
+  struct device_domain_info, link);
+   cap &= domain_info->iommu->cap;
+   ecap &= domain_info->iommu->ecap;
+   spin_unlock_irqrestore(_domain_lock, flags);
+
+   info->format = IOMMU_PASID_FORMAT_INTEL_VTD;
+   info->features = IOMMU_NESTING_FEAT_SYSWIDE_PASID |
+IOMMU_NESTING_FEAT_BIND_PGTBL |
+IOMMU_NESTING_FEAT_CACHE_INVLD;
+   info->addr_width = dmar_domain->gaw;
+   info->pasid_bits = ilog2(intel_pasid_max_id);
+   info->padding = 0;
+   vtd.flags = 0;
+   vtd.padding = 0;
+   vtd.cap_reg = cap;
+   vtd.ecap_reg = ecap;
+
+   memcpy(info->data, , sizeof(vtd));
+   return 0;
+}
+
+static int intel_iommu_domain_get_attr(struct iommu_domain *domain,
+  enum iommu_attr attr, void *data)
+{
+   switch (attr) {
+   case DOMAIN_ATTR_NESTING:
+   {
+   struct iommu_nesting_info *info =
+   (struct iommu_nesting_info *)data;
+
+   return intel_iommu_get_nesting_info(domain, info);
+   }
+   default:
+   return -ENODEV;
+   }
+}
+
 /*
  * Check that the device does not live on an external facing PCI port that is
  * marked as untrusted. Such devices should not be able to apply quirks and
@@ -6101,6 +6177,7 @@ const struct iommu_ops intel_iommu_ops = {
.domain_alloc   = intel_iommu_domain_alloc,
.domain_free= intel_iommu_domain_free,
.domain_set_attr= intel_iommu_domain_set_attr,
+   .domain_get_attr= intel_iommu_domain_get_attr,
.attach_dev = intel_iommu_attach_device,
.detach_dev = intel_iommu_detach_device,
.aux_attach_dev = intel_iommu_aux_attach_device,
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 18f292e..c4ed0d4 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -197,6 +197,22 @@
 #define ecap_max_handle_mask(e) ((e >> 20) & 0xf)
 #define ecap_sc_support(e) ((e >> 7) & 0x1) /* Snoopi

[PATCH v5 01/15] vfio/type1: Refactor vfio_iommu_type1_ioctl()

2020-07-12 Thread Liu Yi L
This patch refactors the vfio_iommu_type1_ioctl() to use switch instead of
if-else, and each command got a helper function.

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Reviewed-by: Eric Auger 
Suggested-by: Christoph Hellwig 
Signed-off-by: Liu Yi L 
---
v4 -> v5:
*) address comments from Eric Auger, add r-b from Eric.
---
 drivers/vfio/vfio_iommu_type1.c | 394 ++--
 1 file changed, 213 insertions(+), 181 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 5e556ac..3bd70ff 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2453,6 +2453,23 @@ static int vfio_domains_have_iommu_cache(struct 
vfio_iommu *iommu)
return ret;
 }
 
+static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
+   unsigned long arg)
+{
+   switch (arg) {
+   case VFIO_TYPE1_IOMMU:
+   case VFIO_TYPE1v2_IOMMU:
+   case VFIO_TYPE1_NESTING_IOMMU:
+   return 1;
+   case VFIO_DMA_CC_IOMMU:
+   if (!iommu)
+   return 0;
+   return vfio_domains_have_iommu_cache(iommu);
+   default:
+   return 0;
+   }
+}
+
 static int vfio_iommu_iova_add_cap(struct vfio_info_cap *caps,
 struct vfio_iommu_type1_info_cap_iova_range *cap_iovas,
 size_t size)
@@ -2529,241 +2546,256 @@ static int vfio_iommu_migration_build_caps(struct 
vfio_iommu *iommu,
return vfio_info_add_capability(caps, _mig.header, sizeof(cap_mig));
 }
 
-static long vfio_iommu_type1_ioctl(void *iommu_data,
-  unsigned int cmd, unsigned long arg)
+static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
+unsigned long arg)
 {
-   struct vfio_iommu *iommu = iommu_data;
+   struct vfio_iommu_type1_info info;
unsigned long minsz;
+   struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+   unsigned long capsz;
+   int ret;
 
-   if (cmd == VFIO_CHECK_EXTENSION) {
-   switch (arg) {
-   case VFIO_TYPE1_IOMMU:
-   case VFIO_TYPE1v2_IOMMU:
-   case VFIO_TYPE1_NESTING_IOMMU:
-   return 1;
-   case VFIO_DMA_CC_IOMMU:
-   if (!iommu)
-   return 0;
-   return vfio_domains_have_iommu_cache(iommu);
-   default:
-   return 0;
-   }
-   } else if (cmd == VFIO_IOMMU_GET_INFO) {
-   struct vfio_iommu_type1_info info;
-   struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
-   unsigned long capsz;
-   int ret;
-
-   minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+   minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
 
-   /* For backward compatibility, cannot require this */
-   capsz = offsetofend(struct vfio_iommu_type1_info, cap_offset);
+   /* For backward compatibility, cannot require this */
+   capsz = offsetofend(struct vfio_iommu_type1_info, cap_offset);
 
-   if (copy_from_user(, (void __user *)arg, minsz))
-   return -EFAULT;
+   if (copy_from_user(, (void __user *)arg, minsz))
+   return -EFAULT;
 
-   if (info.argsz < minsz)
-   return -EINVAL;
+   if (info.argsz < minsz)
+   return -EINVAL;
 
-   if (info.argsz >= capsz) {
-   minsz = capsz;
-   info.cap_offset = 0; /* output, no-recopy necessary */
-   }
+   if (info.argsz >= capsz) {
+   minsz = capsz;
+   info.cap_offset = 0; /* output, no-recopy necessary */
+   }
 
-   mutex_lock(>lock);
-   info.flags = VFIO_IOMMU_INFO_PGSIZES;
+   mutex_lock(>lock);
+   info.flags = VFIO_IOMMU_INFO_PGSIZES;
 
-   info.iova_pgsizes = iommu->pgsize_bitmap;
+   info.iova_pgsizes = iommu->pgsize_bitmap;
 
-   ret = vfio_iommu_migration_build_caps(iommu, );
+   ret = vfio_iommu_migration_build_caps(iommu, );
 
-   if (!ret)
-   ret = vfio_iommu_iova_build_caps(iommu, );
+   if (!ret)
+   ret = vfio_iommu_iova_build_caps(iommu, );
 
-   mutex_unlock(>lock);
+   mutex_unlock(>lock);
 
-   if (ret)
-   return ret;
+   if (ret)
+   return ret;
 
-   if (caps.size) {
-   info.flags |= VFIO_IOMMU_INFO_CAPS;
+   if (caps.size) {
+   info.flags |= VFIO_IOMMU_INFO_CAPS;
 
-   if (info

[PATCH v5 04/15] vfio/type1: Report iommu nesting info to userspace

2020-07-12 Thread Liu Yi L
This patch exports iommu nesting capability info to user space through
VFIO. User space is expected to check this info for supported uAPIs (e.g.
PASID alloc/free, bind page table, and cache invalidation) and the vendor
specific format information for first level/stage page table that will be
bound to.

The nesting info is available only after the nesting iommu type is set
for a container. Current implementation imposes one limitation - one
nesting container should include at most one group. The philosophy of
vfio container is having all groups/devices within the container share
the same IOMMU context. When vSVA is enabled, one IOMMU context could
include one 2nd-level address space and multiple 1st-level address spaces.
While the 2nd-level address space is reasonably sharable by multiple groups
, blindly sharing 1st-level address spaces across all groups within the
container might instead break the guest expectation. In the future sub/
super container concept might be introduced to allow partial address space
sharing within an IOMMU context. But for now let's go with this restriction
by requiring singleton container for using nesting iommu features. Below
link has the related discussion about this decision.

https://lkml.org/lkml/2020/5/15/1028

Cc: Kevin Tian 
CC: Jacob Pan 
Cc: Alex Williamson 
Cc: Eric Auger 
Cc: Jean-Philippe Brucker 
Cc: Joerg Roedel 
Cc: Lu Baolu 
Signed-off-by: Liu Yi L 
---
v4 -> v5:
*) address comments from Eric Auger.
*) return struct iommu_nesting_info for VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as
   cap is much "cheap", if needs extension in future, just define another cap.
   https://lore.kernel.org/kvm/20200708132947.5b7ee...@x1.home/

v3 -> v4:
*) address comments against v3.

v1 -> v2:
*) added in v2
---
 drivers/vfio/vfio_iommu_type1.c | 102 +++-
 include/uapi/linux/vfio.h   |  19 
 2 files changed, 109 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 3bd70ff..ed80104 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -62,18 +62,20 @@ MODULE_PARM_DESC(dma_entry_limit,
 "Maximum number of user DMA mappings per container (65535).");
 
 struct vfio_iommu {
-   struct list_headdomain_list;
-   struct list_headiova_list;
-   struct vfio_domain  *external_domain; /* domain for external user */
-   struct mutexlock;
-   struct rb_root  dma_list;
-   struct blocking_notifier_head notifier;
-   unsigned intdma_avail;
-   uint64_tpgsize_bitmap;
-   boolv2;
-   boolnesting;
-   booldirty_page_tracking;
-   boolpinned_page_dirty_scope;
+   struct list_headdomain_list;
+   struct list_headiova_list;
+   /* domain for external user */
+   struct vfio_domain  *external_domain;
+   struct mutexlock;
+   struct rb_root  dma_list;
+   struct blocking_notifier_head   notifier;
+   unsigned intdma_avail;
+   uint64_tpgsize_bitmap;
+   boolv2;
+   boolnesting;
+   booldirty_page_tracking;
+   boolpinned_page_dirty_scope;
+   struct iommu_nesting_info   *nesting_info;
 };
 
 struct vfio_domain {
@@ -130,6 +132,9 @@ struct vfio_regions {
 #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)\
(!list_empty(>domain_list))
 
+#define CONTAINER_HAS_DOMAIN(iommu)(((iommu)->external_domain) || \
+(!list_empty(&(iommu)->domain_list)))
+
 #define DIRTY_BITMAP_BYTES(n)  (ALIGN(n, BITS_PER_TYPE(u64)) / BITS_PER_BYTE)
 
 /*
@@ -1929,6 +1934,13 @@ static void vfio_iommu_iova_insert_copy(struct 
vfio_iommu *iommu,
 
list_splice_tail(iova_copy, iova);
 }
+
+static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
+{
+   kfree(iommu->nesting_info);
+   iommu->nesting_info = NULL;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 struct iommu_group *iommu_group)
 {
@@ -1959,6 +1971,12 @@ static int vfio_iommu_type1_attach_group(void 
*iommu_data,
}
}
 
+   /* Nesting type container can include only one group */
+   if (iommu->nesting && CONTAINER_HAS_DOMAIN(iommu)) {
+   mutex_unlock(>lock);
+   return -EINVAL;
+   }
+
group = kzalloc(sizeof(*group), GFP_KERNEL);
domain = kzalloc(sizeof(*domain), GFP_KERNEL);
if (!group || !domain) {
@@ -2029,6 +2047,32

  1   2   3   4   >