RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-08 Thread Tian, Kevin
> From: Raj, Ashok 
> Sent: Friday, May 7, 2021 12:33 AM
> 
> > Basically it means when the guest's top level IOASID is created for
> > nesting that IOASID claims all PASID's on the RID and excludes any
> > PASID IOASIDs from existing on the RID now or in future.
> 
> The way to look at it this is as follows:
> 
> For platforms that do not have a need to support shared work queue model
> support for ENQCMD or similar, PASID space is naturally per RID. There is no
> complication with this. Every RID has the full range of PASID's and no need
> for host to track which PASIDs are allocated now or in future in the guest.
> 
> For platforms that support ENQCMD, it is required to mandate PASIDs are
> global across the entire system. Maybe its better to call them gPASID for
> guest and hPASID for host. Short reason being gPASID->hPASID is a guest
> wide mapping for ENQCMD and not a per-RID based mapping. (We covered
> that
> in earlier responses)
> 
> In our current implementation we actually don't separate this space, and
> gPASID == hPASID. The iommu driver enforces that by using the custom
> allocator and the architected interface that allows all guest vIOMMU
> allocations to be proxied to host. Nothing but a glorified hypercall like
> interface. In fact some OS's do use hypercall to get a hPASID vs using
> the vCMD style interface.
> 

After more thinking about the new interface, I feel gPASID==hPASID 
actually causes some confusion in uAPI design. In concept an ioasid
is not active until it's attached to a device, because it's just an ID
if w/o a device. So supposedly an ioasid should reject all user commands
before attach. However an guest likely asks for a new gPASID before
attaching it to devices and vIOMMU. if gPASID==hPASID then Qemu 
must request /dev/ioasid to allocate a hw_id for an ioasid which hasn't 
been attached to any device, with the assumption on kernel knowledge 
that this hw_id is from an global allocator w/o dependency on any 
device. This doesn't sound a clean design, not to say it also conflicts 
with live migration.

Want to hear your and Jason's opinion about an alternative option to 
remove such restriction thus allowing gPASID!=hPASID.

gPASID!=hPASID has a problem when assigning a physical device which 
supports both shared work queue (ENQCMD with PASID in MSR) 
and dedicated work queue (PASID in device register) to a guest
process which is associated to a gPASID. Say the host kernel has setup
the hPASID entry with nested translation though /dev/ioasid. For 
shared work queue the CPU is configured to translate gPASID in MSR 
into **hPASID** before the payload goes out to the wire. However 
for dedicated work queue the device MMIO register is directly mapped 
to and programmed by the guest, thus containing a **gPASID** value
implying DMA requests through this interface will hit IOMMU faults
due to invalid gPASID entry. Having gPASID==hPASID is a simple 
workaround here. mdev doesn't have this problem because the
PASID register is in emulated control-path thus can be translated
to hPASID manually by mdev driver.

Along this story one possible option is having both gPASID and hPASID 
entries pointing to the same paging structure, sort of making gPASID
an aliasing hw_id to hPASID. Then we also need to make sure gPASID
range not colliding with hPASID range for this RID. Something like
below:

In the beginning Qemu specifies a minimal ID (say 1024) that hPASIDs 
must be allocated beyond (sort of delegating [0, 1023] of this RID to
userspace):
ioctl(ioasid_fd, SET_IOASID_MIN_HWID, 1024);

The guest still uses vIOMMU interface or hypercall to allocate gPASIDs.
Upon such request, Qemu returns a gPASID from [0, 1023] to guest 
and also allocates a new ioasid from /dev/ioasid. there is no hw_id 
allocated at this step:
ioasid = ioctl(ioasid_fd, ALLOC_IOASID);

hw_id (hPASID) is allocated when attaching ioasid to the said device:
ioctl(device_fd, VFIO_ATTACH_IOASID, ioasid);

Then gPASID is provided as an aliasing hwid to this ioasid:
ioctl(device_fd, VFIO_ALIASING_IOASID, ioasid, gPASID);

Starting from this point the kernel should make sure that any ioasid
operation should be applied to both gPASID and hPASID for this RID 
(entry setup, tear down, etc.)  and both PASID entries point to the same
paging structures. When a page fault happens, the IOMMU driver 
should also link a fault from either PASID back to the associated ioasid. 

As explained earlier this aliasing requirement only applies to physical
devices on Intel platform. We may either have mdev ignore such 
aliasing request, or have vfio device to report whether aliasing is allowed.

This is sort of a hybrid model. the gPASID range is reserved locally
in per-RID pasid space and delegated to userspace, while the hPASIDs 
are still managed globally and not exposed to userspace.

Does it sound a cleaner approach (still w/ some complexity) compared
to the restrictions of having gPASID==hPASID?

Thanks
Kevin

Re: [RFC PATCH v4 01/13] iommu: Introduce dirty log tracking framework

2021-05-08 Thread Keqian Zhu
Hi Baolu,

On 2021/5/8 11:46, Lu Baolu wrote:
> Hi Keqian,
> 
> On 5/7/21 6:21 PM, Keqian Zhu wrote:
>> Some types of IOMMU are capable of tracking DMA dirty log, such as
>> ARM SMMU with HTTU or Intel IOMMU with SLADE. This introduces the
>> dirty log tracking framework in the IOMMU base layer.
>>
>> Four new essential interfaces are added, and we maintaince the status
>> of dirty log tracking in iommu_domain.
>> 1. iommu_support_dirty_log: Check whether domain supports dirty log tracking
>> 2. iommu_switch_dirty_log: Perform actions to start|stop dirty log tracking
>> 3. iommu_sync_dirty_log: Sync dirty log from IOMMU into a dirty bitmap
>> 4. iommu_clear_dirty_log: Clear dirty log of IOMMU by a mask bitmap
>>
>> Note: Don't concurrently call these interfaces with other ops that
>> access underlying page table.
>>
>> Signed-off-by: Keqian Zhu 
>> Signed-off-by: Kunkun Jiang 
>> ---
>>   drivers/iommu/iommu.c| 201 +++
>>   include/linux/iommu.h|  63 +++
>>   include/trace/events/iommu.h |  63 +++
>>   3 files changed, 327 insertions(+)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 808ab70d5df5..0d15620d1e90 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -1940,6 +1940,7 @@ static struct iommu_domain 
>> *__iommu_domain_alloc(struct bus_type *bus,
>>   domain->type = type;
>>   /* Assume all sizes by default; the driver may override this later */
>>   domain->pgsize_bitmap  = bus->iommu_ops->pgsize_bitmap;
>> +mutex_init(&domain->switch_log_lock);
>> return domain;
>>   }
>> @@ -2703,6 +2704,206 @@ int iommu_set_pgtable_quirks(struct iommu_domain 
>> *domain,
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_set_pgtable_quirks);
>>   +bool iommu_support_dirty_log(struct iommu_domain *domain)
>> +{
>> +const struct iommu_ops *ops = domain->ops;
>> +
>> +return ops->support_dirty_log && ops->support_dirty_log(domain);
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_support_dirty_log);
> 
> I suppose this interface is to ask the vendor IOMMU driver to check
> whether each device/iommu in the domain supports dirty bit tracking.
> But what will happen if new devices with different tracking capability
> are added afterward?
Yep, this is considered in the vfio part. We will query again after attaching or
detaching devices from the domain.  When the domain becomes capable, we enable
dirty log for it. When it becomes not capable, we disable dirty log for it.

> 
> To make things simple, is it possible to support this tracking only when
> all underlying IOMMUs support dirty bit tracking?
IIUC, all underlying IOMMUs you refer is of system wide. I think this idea may 
has
two issues. 1) The target domain may just contains part of system IOMMUs. 2) The
dirty tracking capability can be related to the capability of devices. For 
example,
we can track dirty log based on IOPF, which needs the capability of devices. 
That's
to say, we can make this framework more common.

> 
> Or, the more crazy idea is that we don't need to check this capability
> at all. If dirty bit tracking is not supported by hardware, just mark
> all pages dirty?
Yeah, I think this idea is nice :).

Still one concern is that we may have other dirty tracking methods in the 
future,
if we can't track dirty through iommu, we can still try other methods.

If there is no interface to check this capability, we have no chance to try
other methods. What do you think?

> 
>> +
>> +int iommu_switch_dirty_log(struct iommu_domain *domain, bool enable,
>> +   unsigned long iova, size_t size, int prot)
>> +{
>> +const struct iommu_ops *ops = domain->ops;
>> +unsigned long orig_iova = iova;
>> +unsigned int min_pagesz;
>> +size_t orig_size = size;
>> +bool flush = false;
>> +int ret = 0;
>> +
>> +if (unlikely(!ops->switch_dirty_log))
>> +return -ENODEV;
>> +
>> +min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
>> +if (!IS_ALIGNED(iova | size, min_pagesz)) {
>> +pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n",
>> +   iova, size, min_pagesz);
>> +return -EINVAL;
>> +}
>> +
>> +mutex_lock(&domain->switch_log_lock);
>> +if (enable && domain->dirty_log_tracking) {
>> +ret = -EBUSY;
>> +goto out;
>> +} else if (!enable && !domain->dirty_log_tracking) {
>> +ret = -EINVAL;
>> +goto out;
>> +}
>> +
>> +pr_debug("switch_dirty_log %s for: iova 0x%lx size 0x%zx\n",
>> + enable ? "enable" : "disable", iova, size);
>> +
>> +while (size) {
>> +size_t pgsize = iommu_pgsize(domain, iova, size);
>> +
>> +flush = true;
>> +ret = ops->switch_dirty_log(domain, enable, iova, pgsize, prot);
> 
> Per minimal page callback is much expensive. How about using (pagesize,
> count), so that all pages with the same page size could be handled in a
> single indirect call? I remember I commented

RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-08 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Saturday, May 8, 2021 1:06 AM
> 
> > > Those are the main ones I can think of.  It is nice to have a simple
> > > map/unmap interface, I'd hope that a new /dev/ioasid interface wouldn't
> > > raise the barrier to entry too high, but the user needs to have the
> > > ability to have more control of their mappings and locked page
> > > accounting should probably be offloaded somewhere.  Thanks,
> > >
> >
> > Based on your feedbacks I feel it's probably reasonable to start with
> > a type1v2 semantics for the new interface. Locked accounting could
> > also start with the same VFIO restriction and then improve it
> > incrementally, if a cleaner way is intrusive (if not affecting uAPI).
> > But I didn't get the suggestion on "more control of their mappings".
> > Can you elaborate?
> 
> Things like I note above, userspace cannot currently specify mapping
> granularity nor has any visibility to the granularity they get from the
> IOMMU.  What actually happens in the IOMMU is pretty opaque to the user
> currently.  Thanks,
> 

It's much clearer. Based on all the discussions so far I'm thinking about
a staging approach when building the new interface, basically following
the model that Jason pointed out - generic stuff first, then platform 
specific extension:

Phase 1: /dev/ioasid with core ingredients and vfio type1v2 semantics
- ioasid is the software handle representing an I/O page table
- uAPI accepts a type1v2 map/unmap semantics per ioasid
- helpers for VFIO/VDPA to bind ioasid_fd and attach ioasids
- multiple ioasids are allowed without nesting (vIOMMU, or devices
w/ incompatible iommu attributes)
- an ioasid disallows any operation before it's attached to a device
- an ioasid inherits iommu attributes from the 1st device attached
to it
- userspace is expected to manage hardware restrictions and the
kernel only returns error when restrictions are broken
* map/unmap on an ioasid will fail before every device in a group 
is attached to it
* ioasid attach will fail if the new device has incompatibile iommu
attribute as that of this ioasid
- thus no group semantics in uAPI
- no change to vfio container/group/type1 logic, for running existing
vfio applications
* imply some duplication between vfio type1 and ioasid for some time
- new uAPI in vfio to allow explicit opening of a device and then binding
it to the ioasid_fd
* possibly require each device exposed in /dev/vfio/
- support both pdev and mdev

Phase 2: ioasid nesting
- Allow bind/unbind_pgtable semantics per ioasid
- Allow ioasid nesting 
* HW ioasid nesting if supported by platform
* otherwise fall back to SW ioasid nesting (in-kernel shadowing)
- iotlb invalidation per ioasid
- I/O page fault handling per ioasid
- hw_id is not exposed in uAPI. Vendor IOMMU driver decides
when/how hw_id is allocated and programmed properly

Phase3: optimizations and vendor extensions (order undefined, up to
the specific feature owner):
- (Intel) ENQCMD support with hw_id exposure in uAPI
- (ARM/AMD) RID-based pasid table assignment
- (PPC) window-based iova management
- Optimizations:
* replace vfio type1 with a shim driver to use ioasid backend
* mapping granularity
* HW dirty page tracking
* ...

Does above sounds a sensible plan? If yes we'll start working on 
phase1 then...

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-08 Thread Liu Yi L
Hi Jason,

On Wed, 5 May 2021 19:21:20 -0300, Jason Gunthorpe wrote:

> On Wed, May 05, 2021 at 01:04:46PM -0700, Jacob Pan wrote:
> > Hi Jason,
> > 
> > On Wed, 5 May 2021 15:00:23 -0300, Jason Gunthorpe  wrote:
> >   
> > > On Wed, May 05, 2021 at 10:22:59AM -0700, Jacob Pan wrote:
> > >   
> > > > Global and pluggable are for slightly separate reasons.
> > > > - We need global PASID on VT-d in that we need to support shared
> > > > workqueues (SWQ). E.g. One SWQ can be wrapped into two mdevs then
> > > > assigned to two VMs. Each VM uses its private guest PASID to submit
> > > > work but each guest PASID must be translated to a global (system-wide)
> > > > host PASID to avoid conflict. Also, since PASID table storage is per
> > > > PF, if two mdevs of the same PF are assigned to different VMs, the
> > > > PASIDs must be unique.
> > > 
> > > From a protocol perspective each RID has a unique PASID table, and
> > > RIDs can have overlapping PASIDs.
> > >   
> > True, per RID or per PF as I was referring to.
> >   
> > > Since your SWQ is connected to a single RID the requirement that
> > > PASIDs are unique to the RID ensures they are sufficiently unique.
> > >   
> > True, but one process can submit work to multiple mdevs from different
> > RIDs/PFs. One process uses one PASID and PASID translation table is per VM.
> > The same PASID is used for all the PASID tables of each RID.  
> 
> If the model is "assign this PASID to this RID" then yes, there is a
> big problem keeping everything straight that can only be solved with a
> global table.
> 
> But if the model is "give me a PASID for this RID" then it isn't such
> a problem.

Let me double confirm if I'm understanding you correctly. So your suggestion
is to have a per-RID PASID namespace, which can be maintainer by IOMMU driver.
right? Take native SVM usage as an example, everytime a process is bound with
a device, a PASID within this RID will be allocated. Am I correct so far?

If yes, then there is a case in which IOTLB efficiency is really low. Let's ay
there is a process bound with multiple devices(RIDs) and has different PASIDs
allocated for each RID. In such case, the PASID values are different for each
RID. As most vendor will do, PASID will be used to tag IOTLB entries. So in such
case, here will be multiple IOTLB entries for a single VA->PA mapping. And the
number of such duplicate IOTLB entries increases linearly per the number of the
device number. Seems not good from performance perspective.

> 
> Basically trying to enforce a uniform PASID for an IOASID across all
> RIDs attached to it is not such a nice choice.
> 
> > > That is fine, but all this stuff should be inside the Intel vIOMMU
> > > driver not made into a global resource of the entire iommu subsystem.
> > >   
> > Intel vIOMMU has to use a generic uAPI to allocate PASID so the generic
> > code need to have this option. I guess you are saying we should also have a
> > per RID allocation option in addition to global?  
> 
> There always has to be a RID involvement for the PASID, for security,
> this issue really boils down to where the PASID lives.
> 
> If you need the PASID attached to the IOASID then it has to be global
> because the IOASID can be attached to any RID and must keep the same
> PASID.
> 
> If the PASID is learned when the IOASID is attached to a RID then the
> PASID is more flexible and isn't attached to the IOASID.
> 
> Honestly I'm a little leary to bake into a UAPI a specific HW choice
> that Intel made here.
> 
> I would advise making the "attach a global PASID to this IOASID"
> operation explicit and opt into for case that actually need it.
> 
> Which implies the API to the iommu driver should be more like:
> 
>   'assign an IOASID to this RID and return the PASID'
>   'reserve a PASID from every RID'
>   'assign an IOASID to this RID and use this specific PASID'
> 
> In all cases the scope of those operations are completely local to a
> certain IOMMU driver - 'reserver a PASID from every RID' is really
> every RID that driver can operate on.

Also, this reservation will be failed if the PASID happens to be occupied
by previous usage. As the PASID translation table is per-VM, ENQCMD in VM
will be a problem under such PASID management model.

> 
> So it is hard to see why the allocator should be a global resource and
> not something that is part of the iommu driver exclusively.
> 
> Jason

-- 
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu