RE: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage
Hi Baolu, > -Original Message- > From: Lu Baolu [mailto:baolu...@linux.intel.com] > Sent: Friday, April 2, 2021 12:44 PM > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > ; iommu@lists.linux-foundation.org; > linux-ker...@vger.kernel.org > Cc: baolu...@linux.intel.com; David Woodhouse ; Nadav > Amit ; Alex Williamson ; > Kevin Tian ; Gonglei (Arei) ; > sta...@vger.kernel.org > Subject: Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating > superpage > > Hi Longpeng, > > On 4/1/21 3:18 PM, Longpeng(Mike) wrote: > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c > > index ee09323..cbcb434 100644 > > --- a/drivers/iommu/intel/iommu.c > > +++ b/drivers/iommu/intel/iommu.c > > @@ -2342,9 +2342,20 @@ static inline int hardware_largepage_caps(struct > dmar_domain *domain, > > * removed to make room for superpage(s). > > * We're adding new large pages, so make sure > > * we don't remove their parent tables. > > +* > > +* We also need to flush the iotlb before > > creating > > +* superpage to ensure it does not perserves any > > +* obsolete info. > > */ > > - dma_pte_free_pagetable(domain, iov_pfn, end_pfn, > > - largepage_lvl + 1); > > + if (dma_pte_present(pte)) { > > The dma_pte_free_pagetable() clears a batch of PTEs. So checking current PTE > is > insufficient. How about removing this check and always performing cache > invalidation? > Um...the PTE here may be present( e.g. 4K mapping --> superpage mapping ) or NOT-present ( e.g. create a totally new superpage mapping ), but we only need to call free_pagetable and flush_iotlb in the former case, right ? > > + int i; > > + > > + dma_pte_free_pagetable(domain, iov_pfn, > > end_pfn, > > + largepage_lvl + > > 1); > > + for_each_domain_iommu(i, domain) > > + > > iommu_flush_iotlb_psi(g_iommus[i], domain, > > + iov_pfn, > > nr_pages, 0, 0); > > + > > Best regards, > baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RESEND] lib/scatterlist: Fix NULL pointer deference
Looks good, Reviewed-by: Christoph Hellwig ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
> From: Jason Gunthorpe > Sent: Tuesday, April 6, 2021 8:21 PM > > On Tue, Apr 06, 2021 at 01:02:05AM +, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Tuesday, April 6, 2021 7:40 AM > > > > > > On Fri, Apr 02, 2021 at 07:58:02AM +, Tian, Kevin wrote: > > > > > From: Jason Gunthorpe > > > > > Sent: Thursday, April 1, 2021 9:47 PM > > > > > > > > > > On Thu, Apr 01, 2021 at 01:43:36PM +, Liu, Yi L wrote: > > > > > > > From: Jason Gunthorpe > > > > > > > Sent: Thursday, April 1, 2021 9:16 PM > > > > > > > > > > > > > > On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote: > > > > > > > > > From: Jason Gunthorpe > > > > > > > > > Sent: Thursday, April 1, 2021 7:47 PM > > > > > > > > [...] > > > > > > > > > I'm worried Intel views the only use of PASID in a guest is > > > > > > > > > with > > > > > > > > > ENQCMD, but that is not consistent with the industry. We need > to > > > see > > > > > > > > > normal nested PASID support with assigned PCI VFs. > > > > > > > > > > > > > > > > I'm not quire flow here. Intel also allows PASID usage in guest > > > without > > > > > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it > without > > > > > > > ENQCMD. > > > > > > > > > > > > > > Then you need all the parts, the hypervisor calls from the vIOMMU, > > > and > > > > > > > you can't really use a vPASID. > > > > > > > > > > > > This is a diagram shows the vSVA setup. > > > > > > > > > > I'm not talking only about vSVA. Generic PASID support with arbitary > > > > > mappings. > > > > > > > > > > And how do you deal with the vPASID vs pPASID issue if the system > has > > > > > a mix of physical devices and mdevs? > > > > > > > > > > > > > We plan to support two schemes. One is vPASID identity-mapped to > > > > pPASID then the mixed scenario just works, with the limitation of > > > > lacking of live migration support. The other is non-identity-mapped > > > > scheme, where live migration is supported but physical devices and > > > > mdevs should not be mixed in one VM if both expose SVA capability > > > > (requires some filtering check in Qemu). > > > > > > That just becomes "block vPASID support if any device that > > > doesn't use ENQCMD is plugged into the guest" > > > > The limitation is only for physical device. and in reality it is not that > > bad. To support live migration with physical device we anyway need > > additional work to migrate the device state (e.g. based on Max's work), > > then it's not unreasonable to also mediate guest programming of > > device specific PASID register to enable vPASID (need to translate in > > the whole VM lifespan but likely is not a hot path). > > IMHO that is pretty unreasonable.. More likely we end up with vPASID > tables in each migratable device like KVM has. just like mdev needs to maintain allowed PASID list, this extends it to all migratable devices. > > > > Which needs a special VFIO capability of some kind so qemu knows to > > > block it. This really needs to all be layed out together so someone > > > can understand it :( > > > > Or could simply based on whether the VFIO device supports live migration. > > You need to define affirmative caps that indicate that vPASID will be > supported by the VFIO device. Yes, this is required as I acked in another mail. > > > > Why doesn't the siov cookbook explaining this stuff?? > > > > > > > We hope the /dev/ioasid can support both schemes, with the minimal > > > > requirement of allowing userspace to tag a vPASID to a pPASID and > > > > allowing mdev to translate vPASID into pPASID, i.e. not assuming that > > > > the guest will always use pPASID. > > > > > > What I'm a unclear of is if /dev/ioasid even needs to care about > > > vPASID or if vPASID is just a hidden artifact of the KVM connection to > > > setup the translation table and the vIOMMU driver in qemu. > > > > Not just for KVM. Also required by mdev, which needs to translate > > vPASID into pPASID when ENQCMD is not used. > > Do we have any mdev's that will do this? definitely. Actually any mdev which doesn't do ENQCMD needs to do this. In normal case, the PASID is programmed to a MMIO register (or in-memory context) associate with the backend resource of the mdev. The value programmed from the guest is vPASID, thus must be translated into pPASID before updating the physical register. > > > should only care about the operations related to pPASID. VFIO could > > carry vPASID information to mdev. > > It depends how common this is, I suppose > based on above I think it's a common case. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
> From: Jason Gunthorpe > Sent: Tuesday, April 6, 2021 8:35 PM > > On Tue, Apr 06, 2021 at 01:27:15AM +, Tian, Kevin wrote: > > > > and here is one example why using existing VFIO/VDPA interface makes > > sense. say dev1 (w/ sva) and dev2 (w/o sva) are placed in a single VFIO > > container. > > Forget about SVA, it is an irrelevant detail of how a PASID is > configured. > > > The container is associated to an iommu domain which contains a > > single 2nd-level page table, shared by both devices (when attached > > to the domain). > > This level should be described by an ioasid. > > > The VFIO MAP operation is applied to the 2nd-level > > page table thus naturally applied to both devices. Then userspace > > could use /dev/ioasid to further allocate IOASIDs and bind multiple > > 1st-level page tables for dev1, nested on the shared 2nd-level page > > table. > > Because if you don't then we enter insane world where a PASID is being > created under /dev/ioasid but its translation path flows through setup > done by VFIO and the whole user API becomes an incomprehensible mess. > > How will you even associate the PASID with the other translation?? PASID is attached to a specific iommu domain (created by VFIO/VDPA), which has GPA->HPA mappings already configured. If we view that mapping as an attribute of the iommu domain, it's reasonable to have the userspace-bound pgtable through /dev/ioasid to nest on it. > > The entire translation path for any ioasid or PASID should be defined > only by /dev/ioasid. Everything else is a legacy API. > > > If following your suggestion then VFIO must deny VFIO MAP operations > > on sva1 (assume userspace should not mix sva1 and sva2 in the same > > container and instead use /dev/ioasid to map for sva1)? > > No, userspace creates an iosaid for the guest physical mapping and > passes this ioasid to VFIO PCI which will assign it as the first layer > mapping on the RID Is it an dummy ioasid just for providing GPA mappings for nesting purpose of other IOASIDs? Then we waste one per VM? > > When PASIDs are allocated the uAPI will be told to logically nested > under the first ioasid. When VFIO authorizes a PASID for a RID it > checks that all the HW rules are being followed. As I explained above, why cannot we just use iommu domain to connect the dots? Every passthrough framework needs to create an iommu domain first. and It needs to support both devices w/ PASID and devices w/o PASID. For devices w/o PASID it needs to invent its own MAP interface anyway. Then why do we bother creating another MAP interface through /dev/ioasid which not only duplicates but also creating transition burden between two set of MAP interfaces when the guest turns on/off the pasid capability on the device? > > If there are rules like groups of VFIO devices must always use the > same IOASID then VFIO will check these too (and realistically qemu > will have only one guest physical map ioasid anyhow) > > There is no real difference between setting up an IOMMU table for a > (RID,PASID) tuple or just a RID. We can do it universally with > one interface for all consumers. > 'universally' upon from which angle you look at this problem. From IOASID p.o.v possibly yes, but from device passthrough p.o.v. it's the opposite since the passthrough framework needs to handle devices w/o PASID anyway (or even for device w/ PASID it could send traffic w/o PASID) thus 'universally' makes more sense if the passthrough framework can use one interface of its own to manage GPA mappings for all consumers (apply to the case when a PASID is allowed/authorized). Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 7/9] vfio/mdev: Add iommu related member in mdev_device
Hi Jason, On 4/7/21 4:00 AM, Jason Gunthorpe wrote: On Mon, Mar 25, 2019 at 09:30:34AM +0800, Lu Baolu wrote: A parent device might create different types of mediated devices. For example, a mediated device could be created by the parent device with full isolation and protection provided by the IOMMU. One usage case could be found on Intel platforms where a mediated device is an assignable subset of a PCI, the DMA requests on behalf of it are all tagged with a PASID. Since IOMMU supports PASID-granular translations (scalable mode in VT-d 3.0), this mediated device could be individually protected and isolated by an IOMMU. This patch adds a new member in the struct mdev_device to indicate that the mediated device represented by mdev could be isolated and protected by attaching a domain to a device represented by mdev->iommu_device. It also adds a helper to add or set the iommu device. * mdev_device->iommu_device - This, if set, indicates that the mediated device could be fully isolated and protected by IOMMU via attaching an iommu domain to this device. If empty, it indicates using vendor defined isolation, hence bypass IOMMU. * mdev_set/get_iommu_device(dev, iommu_device) - Set or get the iommu device which represents this mdev in IOMMU's device scope. Drivers don't need to set the iommu device if it uses vendor defined isolation. Cc: Ashok Raj Cc: Jacob Pan Cc: Kevin Tian Cc: Liu Yi L Suggested-by: Kevin Tian Suggested-by: Alex Williamson Signed-off-by: Lu Baolu Reviewed-by: Jean-Philippe Brucker --- drivers/vfio/mdev/mdev_core.c| 18 ++ drivers/vfio/mdev/mdev_private.h | 1 + include/linux/mdev.h | 14 ++ 3 files changed, 33 insertions(+) diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c index b96fedc77ee5..1b6435529166 100644 +++ b/drivers/vfio/mdev/mdev_core.c @@ -390,6 +390,24 @@ int mdev_device_remove(struct device *dev, bool force_remove) return 0; } +int mdev_set_iommu_device(struct device *dev, struct device *iommu_device) +{ + struct mdev_device *mdev = to_mdev_device(dev); + + mdev->iommu_device = iommu_device; + + return 0; +} +EXPORT_SYMBOL(mdev_set_iommu_device); I was looking at these functions when touching the mdev stuff and I have some concerns. 1) Please don't merge dead code. It is a year later and there is still no in-tree user for any of this. This is not our process. Even worse it was exported so it looks like this dead code is supporting out of tree modules. 2) Why is this like this? Every struct device already has a connection to the iommu layer and every mdev has a struct device all its own. Why did we need to add special 'if (mdev)' stuff all over the place? This smells like the same abuse Thomas and I pointed out for the interrupt domains. I've ever tried to implement a bus iommu_ops for mdev devices. https://lore.kernel.org/lkml/20201030045809.957927-1-baolu...@linux.intel.com/ Any comments? Best regards, baolu After my next series the mdev drivers will have direct access to the vfio_device. So an alternative to using the struct device, or adding 'if mdev' is to add an API to the vfio_device world to inject what iommu configuration is needed from that direction instead of trying to discover it from a struct device. 3) The vfio_bus_is_mdev() and related symbol_get() nonsense in drivers/vfio/vfio_iommu_type1.c has to go, for the same reasons it was not acceptable to do this for the interrupt side either. 4) It seems pretty clear to me this will be heavily impacted by the /dev/ioasid discussion. Please consider removing the dead code now. Basically, please fix this before trying to get idxd mdev merged as the first user. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
在 2021/4/6 下午8:42, Jason Gunthorpe 写道: On Tue, Apr 06, 2021 at 09:35:17AM +0800, Jason Wang wrote: VFIO and VDPA has no buisness having map/unmap interfaces once we have /dev/ioasid. That all belongs in the iosaid side. I know they have those interfaces today, but that doesn't mean we have to keep using them for PASID use cases, they should be replaced with a 'do dma from this pasid on /dev/ioasid' interface certainly not a 'here is a pasid from /dev/ioasid, go ahead and configure it youself' interface So it looks like the PASID was bound to SVA in this design. I think it's not necessairly the case: No, I wish people would stop talking about SVA. SVA and vSVA are a very special narrow configuration of a PASID. There are lots of other PASID configurations! That is the whole point, a PASID is complicated, there are many configuration scenarios, they need to be in one place with a very clearly defined uAPI Right, that's my understanding as well. 1) PASID can be implemented without SVA, in this case a map/unmap interface is still required Any interface to manipulate a PASID should be under /dev/ioasid. We do not want to duplicate this into every subsystem. Yes. 2) For the case that hypervisor want to do some mediation in the middle for a virtqueue. e.g in the case of control vq that is implemented in the VF/ADI/SF itself, the hardware virtqueue needs to be controlled by Qemu, Though binding qemu's page table to cvq can work but it looks like a overkill, a small dedicated buffers that is mapped for this PASID seems more suitalbe. /dev/ioasid should allow userspace to setup any PASID configuration it wants. There are many choices. That is the whole point, instead of copying&pasting all the PASID configuration option into every subsystem we have on place to configure it. If you want a PASID (or generic ioasid) that has the guest physical map, which is probably all that VDPA would ever want, then /dev/ioasid should be able to prepare that. If you just want to map a few buffers into a PASID then it should be able to do that too. So do you mean the device should not expose the PASID confiugration API to guest? I think it could happen if we assign the whole device and let guest to configure it for nested VMs. This always needs co-operating with the vIOMMU driver. We can't have nested PASID use without both parts working together. The vIOMMU driver configures the PASID and assigns the mappings (however complicated that turns out to be) The VDPA/mdev driver authorizes the HW to use the ioasid mapping, eg by authorizing a queue to issue PCIe TLPs with a specific PASID. The authorization is triggered by the guest telling the vIOMMU to allow a vRID to talk to a PASID, which qemu would have to translate to telling something like the VDPA driver under the vRID that it can use a PASID from /dev/ioasid For security a VDPA/mdev device MUST NOT issue PASIDs that the vIOMMU has not authorized its vRID to use. Otherwise the security model of something like VFIO in the guest becomes completely broken. Yes, that's how it should work. Thanks Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5.4 1/1] iommu/vt-d: Fix agaw for a supported 48 bit guest address width
Hi Saeed, On 4/7/21 12:35 AM, Saeed Mirzamohammadi wrote: The IOMMU driver calculates the guest addressability for a DMA request based on the value of the mgaw reported from the IOMMU. However, this is a fused value and as mentioned in the spec, the guest width should be calculated based on the supported adjusted guest address width (SAGAW). This is from specification: "Guest addressability for a given DMA request is limited to the minimum of the value reported through this field and the adjusted guest address width of the corresponding page-table structure. (Adjusted guest address widths supported by hardware are reported through the SAGAW field)." This causes domain initialization to fail and following errors appear for EHCI PCI driver: [2.486393] ehci-pci :01:00.4: EHCI Host Controller [2.486624] ehci-pci :01:00.4: new USB bus registered, assigned bus number 1 [2.489127] ehci-pci :01:00.4: DMAR: Allocating domain failed [2.489350] ehci-pci :01:00.4: DMAR: 32bit DMA uses non-identity mapping [2.489359] ehci-pci :01:00.4: can't setup: -12 [2.489531] ehci-pci :01:00.4: USB bus 1 deregistered [2.490023] ehci-pci :01:00.4: init :01:00.4 fail, -12 [2.490358] ehci-pci: probe of :01:00.4 failed with error -12 This issue happens when the value of the sagaw corresponds to a 48-bit agaw. This fix updates the calculation of the agaw based on the IOMMU's sagaw value. Signed-off-by: Saeed Mirzamohammadi Tested-by: Camille Lu --- drivers/iommu/intel-iommu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 953d86ca6d2b..396e14fad54b 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -1867,8 +1867,8 @@ static int domain_init(struct dmar_domain *domain, struct intel_iommu *iommu, domain_reserve_special_ranges(domain); /* calculate AGAW */ - if (guest_width > cap_mgaw(iommu->cap)) - guest_width = cap_mgaw(iommu->cap); + if (guest_width > agaw_to_width(iommu->agaw)) + guest_width = agaw_to_width(iommu->agaw); The spec requires to use a minimum of MGAW and AGAW, so why not, cap_width = min_t(int, cap_mgaw(iommu->cap), agaw_to_width(iommu->agaw)); if (guest_width > cap_width) guest_width = cap_width; Best regards, baolu domain->gaw = guest_width; adjust_width = guestwidth_to_adjustwidth(guest_width); agaw = width_to_agaw(adjust_width); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 03/12] iommu/io-pgtable: Introduce map_pages() as a page table op
On 2021-04-06 04:57, Will Deacon wrote: On Mon, Apr 05, 2021 at 12:11:03PM -0700, Isaac J. Manjarres wrote: Mapping memory into io-pgtables follows the same semantics that unmapping memory used to follow (i.e. a buffer will be mapped one page block per call to the io-pgtable code). This means that it can be optimized in the same way that unmapping memory was, so add a map_pages() callback to the io-pgtable ops structure, so that a range of pages of the same size can be mapped within the same call. Signed-off-by: Isaac J. Manjarres Suggested-by: Will Deacon --- include/linux/io-pgtable.h | 4 1 file changed, 4 insertions(+) diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h index 2ed0c057d9e7..019149b204b8 100644 --- a/include/linux/io-pgtable.h +++ b/include/linux/io-pgtable.h @@ -143,6 +143,7 @@ struct io_pgtable_cfg { * struct io_pgtable_ops - Page table manipulation API for IOMMU drivers. * * @map: Map a physically contiguous memory region. + * @map_pages:Map a physically contiguous range of pages of the same size. * @unmap:Unmap a physically contiguous memory region. * @unmap_pages: Unmap a range of virtually contiguous pages of the same size. * @iova_to_phys: Translate iova to physical address. @@ -153,6 +154,9 @@ struct io_pgtable_cfg { struct io_pgtable_ops { int (*map)(struct io_pgtable_ops *ops, unsigned long iova, phys_addr_t paddr, size_t size, int prot, gfp_t gfp); + int (*map_pages)(struct io_pgtable_ops *ops, unsigned long iova, +phys_addr_t paddr, size_t pgsize, size_t pgcount, +int prot, gfp_t gfp, size_t *mapped); How about returning 'size_t' and using IS_ERR_VALUE() instead of adding the extra 'mapped' argument (i.e. return the size of the region mapped or an error code)? I don't think we realistically need to care about map sizes that overlap with the error region. I'd given that a shot before, but the problem that I kept running into was that in case of an error, if I return an error code, I don't know how much memory was mapped, so that I can invoke iommu_unmap from __iommu_map with that size to undo the partial mappings from a map_pages() call. Returning the amount of memory that was mapped in the case of an error will be less than the size that was requested, but then we lose the information about why the error happened, since the error code won't be returned, so that's why I went with the current approach. Do you have any other ideas about how to handle this? I'd thought of having arm_lpae_map_pages() invoke arm_lpae_unmap_pages(), but the TLB maintenance was a problem, as we wouldn't invoke iommu_iotlb_sync(). Thanks, Isaac Will ___ linux-arm-kernel mailing list linux-arm-ker...@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 09/12] iommu/io-pgtable-arm: Implement arm_lpae_unmap_pages()
On 2021-04-06 05:15, Will Deacon wrote: On Mon, Apr 05, 2021 at 12:11:09PM -0700, Isaac J. Manjarres wrote: Implement the unmap_pages() callback for the ARM LPAE io-pgtable format. Signed-off-by: Isaac J. Manjarres Suggested-by: Will Deacon --- drivers/iommu/io-pgtable-arm.c | 124 +++-- 1 file changed, 104 insertions(+), 20 deletions(-) [...] +static size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iova, + size_t pgsize, size_t pgcount, + struct iommu_iotlb_gather *gather) +{ + struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops); + struct io_pgtable_cfg *cfg = &data->iop.cfg; + arm_lpae_iopte *ptep = data->pgd; + long iaext = (s64)iova >> cfg->ias; + size_t unmapped = 0, unmapped_page; + int last_lvl; + size_t table_size, pages, tbl_offset, max_entries; + + if (WARN_ON(!pgsize || (pgsize & cfg->pgsize_bitmap) != pgsize || !pgcount)) + return 0; + + if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1) + iaext = ~iaext; + if (WARN_ON(iaext)) + return 0; + + /* +* Calculating the page table size here helps avoid situations where + * a page range that is being unmapped may be mapped at the same level +* but not mapped by the same tables. Allowing such a scenario to +* occur can complicate the logic in arm_lpae_split_blk_unmap(). +*/ + last_lvl = ARM_LPAE_BLOCK_SIZE_LVL(pgsize, data); + + if (last_lvl == data->start_level) + table_size = ARM_LPAE_PGD_SIZE(data); + else + table_size = ARM_LPAE_GRANULE(data); + + max_entries = table_size / sizeof(*ptep); + + while (pgcount) { + tbl_offset = ARM_LPAE_LVL_IDX(iova, last_lvl, data); + pages = min_t(size_t, pgcount, max_entries - tbl_offset); + unmapped_page = __arm_lpae_unmap(data, gather, iova, pgsize, +pages, data->start_level, +ptep); + if (!unmapped_page) + break; + + unmapped += unmapped_page; + iova += unmapped_page; + pgcount -= pages; + } Robin has some comments on the first version of this patch, and I don't think you addressed them: https://lore.kernel.org/r/b93fa0b1-e2a4-1aad-8b88-4d0dfecdf...@arm.com I'm inclined to agree that iterating here doesn't make a lot of sense -- we've already come back out of the page-table walk, so I think we should just return to the caller (who is well prepared to handle a partial unmap). Same for the map side of things. If we get numbers showing that this is causing a performance issue, then we should rework the page-table code to handle this at the lower level (because I doubt the loop that you have is really worse than returning to the caller anyway). Sorry, I seem to have overlooked those comments. I will go ahead and address them. I think it might be ideal to try to do as much work as possible in the io-pgtable level, so as to minimize the number of indirect calls incurred by jumping back and forth between iommu fwk, iommu driver, and io-pgtable code. Perhaps we can try something like how the linear mappings are created on arm64 i.e. on the previous level, we can determine how many pages can be unmapped in one page table in one iteration, and on the subsequent iterations, we can tackle another page table at the lower level. Looking at the code, it doesn't seem too difficult to add this in. Thoughts? Thanks, Isaac Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 7/9] vfio/mdev: Add iommu related member in mdev_device
On Mon, Mar 25, 2019 at 09:30:34AM +0800, Lu Baolu wrote: > A parent device might create different types of mediated > devices. For example, a mediated device could be created > by the parent device with full isolation and protection > provided by the IOMMU. One usage case could be found on > Intel platforms where a mediated device is an assignable > subset of a PCI, the DMA requests on behalf of it are all > tagged with a PASID. Since IOMMU supports PASID-granular > translations (scalable mode in VT-d 3.0), this mediated > device could be individually protected and isolated by an > IOMMU. > > This patch adds a new member in the struct mdev_device to > indicate that the mediated device represented by mdev could > be isolated and protected by attaching a domain to a device > represented by mdev->iommu_device. It also adds a helper to > add or set the iommu device. > > * mdev_device->iommu_device > - This, if set, indicates that the mediated device could > be fully isolated and protected by IOMMU via attaching > an iommu domain to this device. If empty, it indicates > using vendor defined isolation, hence bypass IOMMU. > > * mdev_set/get_iommu_device(dev, iommu_device) > - Set or get the iommu device which represents this mdev > in IOMMU's device scope. Drivers don't need to set the > iommu device if it uses vendor defined isolation. > > Cc: Ashok Raj > Cc: Jacob Pan > Cc: Kevin Tian > Cc: Liu Yi L > Suggested-by: Kevin Tian > Suggested-by: Alex Williamson > Signed-off-by: Lu Baolu > Reviewed-by: Jean-Philippe Brucker > --- > drivers/vfio/mdev/mdev_core.c| 18 ++ > drivers/vfio/mdev/mdev_private.h | 1 + > include/linux/mdev.h | 14 ++ > 3 files changed, 33 insertions(+) > > diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c > index b96fedc77ee5..1b6435529166 100644 > +++ b/drivers/vfio/mdev/mdev_core.c > @@ -390,6 +390,24 @@ int mdev_device_remove(struct device *dev, bool > force_remove) > return 0; > } > > +int mdev_set_iommu_device(struct device *dev, struct device *iommu_device) > +{ > + struct mdev_device *mdev = to_mdev_device(dev); > + > + mdev->iommu_device = iommu_device; > + > + return 0; > +} > +EXPORT_SYMBOL(mdev_set_iommu_device); I was looking at these functions when touching the mdev stuff and I have some concerns. 1) Please don't merge dead code. It is a year later and there is still no in-tree user for any of this. This is not our process. Even worse it was exported so it looks like this dead code is supporting out of tree modules. 2) Why is this like this? Every struct device already has a connection to the iommu layer and every mdev has a struct device all its own. Why did we need to add special 'if (mdev)' stuff all over the place? This smells like the same abuse Thomas and I pointed out for the interrupt domains. After my next series the mdev drivers will have direct access to the vfio_device. So an alternative to using the struct device, or adding 'if mdev' is to add an API to the vfio_device world to inject what iommu configuration is needed from that direction instead of trying to discover it from a struct device. 3) The vfio_bus_is_mdev() and related symbol_get() nonsense in drivers/vfio/vfio_iommu_type1.c has to go, for the same reasons it was not acceptable to do this for the interrupt side either. 4) It seems pretty clear to me this will be heavily impacted by the /dev/ioasid discussion. Please consider removing the dead code now. Basically, please fix this before trying to get idxd mdev merged as the first user. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 06/12] iommu: Split 'addr_merge' argument to iommu_pgsize() into separate parts
On 2021-04-06 04:53, Will Deacon wrote: On Mon, Apr 05, 2021 at 12:11:06PM -0700, Isaac J. Manjarres wrote: From: Will Deacon The 'addr_merge' parameter to iommu_pgsize() is a fabricated address intended to describe the alignment requirements to consider when choosing an appropriate page size. On the iommu_map() path, this address is the logical OR of the virtual and physical addresses. Subsequent improvements to iommu_pgsize() will need to check the alignment of the virtual and physical components of 'addr_merge' independently, so pass them in as separate parameters and reconstruct 'addr_merge' locally. No functional change. Signed-off-by: Will Deacon Signed-off-by: Isaac J. Manjarres --- drivers/iommu/iommu.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 9006397b6604..a3bbf7e310b0 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -2357,12 +2357,13 @@ phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova) } EXPORT_SYMBOL_GPL(iommu_iova_to_phys); -static size_t iommu_pgsize(struct iommu_domain *domain, - unsigned long addr_merge, size_t size) +static size_t iommu_pgsize(struct iommu_domain *domain, unsigned long iova, + phys_addr_t paddr, size_t size) { unsigned int pgsize_idx; unsigned long pgsizes; size_t pgsize; + phys_addr_t addr_merge = paddr | iova; Huh, so this was 'unsigned long' before and, given that the pgsize_bitmap on the domain is also unsigned long, then I think that's fine. So using that would mean you don't need GENMASK_ULL for this guy either. Will Thanks, I will address this in version 4 of the series. --Isaac ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator
So then we have the issue of how to dynamically increase this rcache threshold. The problem is that we may have many devices associated with the same domain. So, in theory, we can't assume that when we increase the threshold that some other device will try to fast free an IOVA which was allocated prior to the increase and was not rounded up. I'm very open to better (or less bad) suggestions on how to do this ... ...but yes, regardless of exactly where it happens, rounding up or not is the problem for rcaches in general. I've said several times that my preferred approach is to not change it that dynamically at all, but instead treat it more like we treat the default domain type. Can you remind me of that idea? I don't remember you mentioning using default domain handling as a reference in any context. Hi Robin, Sorry if the phrasing was unclear there - the allusion to default domains is new, it just occurred to me that what we do there is in fact fairly close to what I've suggested previously for this. In that case, we have a global policy set by the command line, which *can* be overridden per-domain via sysfs at runtime, provided the user is willing to tear the whole thing down. Using a similar approach here would give a fair degree of flexibility but still mean that changes never have to be made dynamically to a live domain. So are you saying that we can handle it similar to how we now can handle changing default domain for an IOMMU group via sysfs? If so, that just is not practical here. Reason being that this particular DMA engine provides the block device giving / mount point, so if we unbind the driver, we lose / mount point. And I am not sure if the end user would even know how to set such a tunable. Or, in this case, why the end user would not want the optimized range configured always. I'd still rather if the device driver could provide info which can be used to configure this before or during probing. Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v5.4 1/1] iommu/vt-d: Fix agaw for a supported 48 bit guest address width
The IOMMU driver calculates the guest addressability for a DMA request based on the value of the mgaw reported from the IOMMU. However, this is a fused value and as mentioned in the spec, the guest width should be calculated based on the supported adjusted guest address width (SAGAW). This is from specification: "Guest addressability for a given DMA request is limited to the minimum of the value reported through this field and the adjusted guest address width of the corresponding page-table structure. (Adjusted guest address widths supported by hardware are reported through the SAGAW field)." This causes domain initialization to fail and following errors appear for EHCI PCI driver: [2.486393] ehci-pci :01:00.4: EHCI Host Controller [2.486624] ehci-pci :01:00.4: new USB bus registered, assigned bus number 1 [2.489127] ehci-pci :01:00.4: DMAR: Allocating domain failed [2.489350] ehci-pci :01:00.4: DMAR: 32bit DMA uses non-identity mapping [2.489359] ehci-pci :01:00.4: can't setup: -12 [2.489531] ehci-pci :01:00.4: USB bus 1 deregistered [2.490023] ehci-pci :01:00.4: init :01:00.4 fail, -12 [2.490358] ehci-pci: probe of :01:00.4 failed with error -12 This issue happens when the value of the sagaw corresponds to a 48-bit agaw. This fix updates the calculation of the agaw based on the IOMMU's sagaw value. Signed-off-by: Saeed Mirzamohammadi Tested-by: Camille Lu --- drivers/iommu/intel-iommu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 953d86ca6d2b..396e14fad54b 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -1867,8 +1867,8 @@ static int domain_init(struct dmar_domain *domain, struct intel_iommu *iommu, domain_reserve_special_ranges(domain); /* calculate AGAW */ - if (guest_width > cap_mgaw(iommu->cap)) - guest_width = cap_mgaw(iommu->cap); + if (guest_width > agaw_to_width(iommu->agaw)) + guest_width = agaw_to_width(iommu->agaw); domain->gaw = guest_width; adjust_width = guestwidth_to_adjustwidth(guest_width); agaw = width_to_agaw(adjust_width); -- 2.27.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH RESEND] lib/scatterlist: Fix NULL pointer deference
When sg_alloc_table_from_pages is called with n_pages = 0, we write in a non-allocated page. Fix it by checking early the error condition. [7.666801] BUG: kernel NULL pointer dereference, address: 0010 [7.667487] #PF: supervisor read access in kernel mode [7.667970] #PF: error_code(0x) - not-present page [7.668448] PGD 0 P4D 0 [7.668690] Oops: [#1] SMP NOPTI [7.669037] CPU: 0 PID: 184 Comm: modprobe Not tainted 5.11.0+ #2 [7.669606] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [7.670378] RIP: 0010:__sg_alloc_table_from_pages+0x2c5/0x4a0 [7.670924] Code: c9 01 48 c7 40 08 00 00 00 00 48 89 08 8b 47 0c 41 8d 44 00 ff 89 47 0c 48 81 fa 00 f0 ff ff 0f 87 d4 01 00 00 49 8b 16 89 d8 <4a> 8b 74 fd 00 4c 89 d1 44 29 f8 c1 e0 0c 44 29 d8 4c 39 d0 48 0f [7.672643] RSP: 0018:ba1e8028fb30 EFLAGS: 00010287 [7.673133] RAX: 0001 RBX: 0001 RCX: 0002 [7.673791] RDX: 0002 RSI: ada6d0ba RDI: 9afe01fff820 [7.674448] RBP: 0010 R08: 0001 R09: 0001 [7.675100] R10: R11: R12: [7.675754] R13: f000 R14: 9afe01fff800 R15: [7.676409] FS: 7fb0f448f540() GS:9afe07a0() knlGS: [7.677151] CS: 0010 DS: ES: CR0: 80050033 [7.677681] CR2: 0010 CR3: 02184001 CR4: 00370ef0 [7.678342] DR0: DR1: DR2: [7.679019] DR3: DR6: fffe0ff0 DR7: 0400 [7.680349] Call Trace: [7.680605] ? device_add+0x146/0x810 [7.681021] sg_alloc_table_from_pages+0x11/0x30 [7.681511] vb2_dma_sg_alloc+0x162/0x280 [videobuf2_dma_sg] Cc: sta...@vger.kernel.org Fixes: efc42bc98058 ("scatterlist: add sg_alloc_table_from_pages function") Signed-off-by: Ricardo Ribalda --- lib/scatterlist.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/lib/scatterlist.c b/lib/scatterlist.c index a59778946404..1e83b6a3d930 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -435,6 +435,9 @@ struct scatterlist *__sg_alloc_table_from_pages(struct sg_table *sgt, unsigned int added_nents = 0; struct scatterlist *s = prv; + if (n_pages == 0) + return ERR_PTR(-EINVAL); + /* * The algorithm below requires max_segment to be aligned to PAGE_SIZE * otherwise it can overshoot. -- 2.31.0.208.g409f899ff0-goog ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v14 09/10] ACPI/IORT: Enable stall support for platform devices
On Thu, Apr 01, 2021 at 05:47:18PM +0200, Jean-Philippe Brucker wrote: > Copy the "Stall supported" bit, that tells whether a named component > supports stall, into the dma-can-stall device property. > > Acked-by: Jonathan Cameron > Signed-off-by: Jean-Philippe Brucker > --- > drivers/acpi/arm64/iort.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) Acked-by: Lorenzo Pieralisi > diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c > index 3912a1f6058e..0828f70cb782 100644 > --- a/drivers/acpi/arm64/iort.c > +++ b/drivers/acpi/arm64/iort.c > @@ -968,13 +968,15 @@ static int iort_pci_iommu_init(struct pci_dev *pdev, > u16 alias, void *data) > static void iort_named_component_init(struct device *dev, > struct acpi_iort_node *node) > { > - struct property_entry props[2] = {}; > + struct property_entry props[3] = {}; > struct acpi_iort_named_component *nc; > > nc = (struct acpi_iort_named_component *)node->node_data; > props[0] = PROPERTY_ENTRY_U32("pasid-num-bits", > FIELD_GET(ACPI_IORT_NC_PASID_BITS, > nc->node_flags)); > + if (nc->node_flags & ACPI_IORT_NC_STALL_SUPPORTED) > + props[1] = PROPERTY_ENTRY_BOOL("dma-can-stall"); > > if (device_add_properties(dev, props)) > dev_warn(dev, "Could not add device properties\n"); > -- > 2.31.1 > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Tue, Apr 06, 2021 at 09:35:17AM +0800, Jason Wang wrote: > > VFIO and VDPA has no buisness having map/unmap interfaces once we have > > /dev/ioasid. That all belongs in the iosaid side. > > > > I know they have those interfaces today, but that doesn't mean we have > > to keep using them for PASID use cases, they should be replaced with a > > 'do dma from this pasid on /dev/ioasid' interface certainly not a > > 'here is a pasid from /dev/ioasid, go ahead and configure it youself' > > interface > > So it looks like the PASID was bound to SVA in this design. I think it's not > necessairly the case: No, I wish people would stop talking about SVA. SVA and vSVA are a very special narrow configuration of a PASID. There are lots of other PASID configurations! That is the whole point, a PASID is complicated, there are many configuration scenarios, they need to be in one place with a very clearly defined uAPI > 1) PASID can be implemented without SVA, in this case a map/unmap interface > is still required Any interface to manipulate a PASID should be under /dev/ioasid. We do not want to duplicate this into every subsystem. > 2) For the case that hypervisor want to do some mediation in the middle for > a virtqueue. e.g in the case of control vq that is implemented in the > VF/ADI/SF itself, the hardware virtqueue needs to be controlled by Qemu, > Though binding qemu's page table to cvq can work but it looks like a > overkill, a small dedicated buffers that is mapped for this PASID seems more > suitalbe. /dev/ioasid should allow userspace to setup any PASID configuration it wants. There are many choices. That is the whole point, instead of copying&pasting all the PASID configuration option into every subsystem we have on place to configure it. If you want a PASID (or generic ioasid) that has the guest physical map, which is probably all that VDPA would ever want, then /dev/ioasid should be able to prepare that. If you just want to map a few buffers into a PASID then it should be able to do that too. > So do you mean the device should not expose the PASID confiugration API to > guest? I think it could happen if we assign the whole device and let guest > to configure it for nested VMs. This always needs co-operating with the vIOMMU driver. We can't have nested PASID use without both parts working together. The vIOMMU driver configures the PASID and assigns the mappings (however complicated that turns out to be) The VDPA/mdev driver authorizes the HW to use the ioasid mapping, eg by authorizing a queue to issue PCIe TLPs with a specific PASID. The authorization is triggered by the guest telling the vIOMMU to allow a vRID to talk to a PASID, which qemu would have to translate to telling something like the VDPA driver under the vRID that it can use a PASID from /dev/ioasid For security a VDPA/mdev device MUST NOT issue PASIDs that the vIOMMU has not authorized its vRID to use. Otherwise the security model of something like VFIO in the guest becomes completely broken. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Tue, Apr 06, 2021 at 01:27:15AM +, Tian, Kevin wrote: > > and here is one example why using existing VFIO/VDPA interface makes > sense. say dev1 (w/ sva) and dev2 (w/o sva) are placed in a single VFIO > container. Forget about SVA, it is an irrelevant detail of how a PASID is configured. > The container is associated to an iommu domain which contains a > single 2nd-level page table, shared by both devices (when attached > to the domain). This level should be described by an ioasid. > The VFIO MAP operation is applied to the 2nd-level > page table thus naturally applied to both devices. Then userspace > could use /dev/ioasid to further allocate IOASIDs and bind multiple > 1st-level page tables for dev1, nested on the shared 2nd-level page > table. Because if you don't then we enter insane world where a PASID is being created under /dev/ioasid but its translation path flows through setup done by VFIO and the whole user API becomes an incomprehensible mess. How will you even associate the PASID with the other translation?? The entire translation path for any ioasid or PASID should be defined only by /dev/ioasid. Everything else is a legacy API. > If following your suggestion then VFIO must deny VFIO MAP operations > on sva1 (assume userspace should not mix sva1 and sva2 in the same > container and instead use /dev/ioasid to map for sva1)? No, userspace creates an iosaid for the guest physical mapping and passes this ioasid to VFIO PCI which will assign it as the first layer mapping on the RID When PASIDs are allocated the uAPI will be told to logically nested under the first ioasid. When VFIO authorizes a PASID for a RID it checks that all the HW rules are being followed. If there are rules like groups of VFIO devices must always use the same IOASID then VFIO will check these too (and realistically qemu will have only one guest physical map ioasid anyhow) There is no real difference between setting up an IOMMU table for a (RID,PASID) tuple or just a RID. We can do it universally with one interface for all consumers. I wanted this when we were doing VDPA for the first time, now that we are doing pasid and more difficult stuff I view it as essential. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Tue, Apr 06, 2021 at 01:02:05AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Tuesday, April 6, 2021 7:40 AM > > > > On Fri, Apr 02, 2021 at 07:58:02AM +, Tian, Kevin wrote: > > > > From: Jason Gunthorpe > > > > Sent: Thursday, April 1, 2021 9:47 PM > > > > > > > > On Thu, Apr 01, 2021 at 01:43:36PM +, Liu, Yi L wrote: > > > > > > From: Jason Gunthorpe > > > > > > Sent: Thursday, April 1, 2021 9:16 PM > > > > > > > > > > > > On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote: > > > > > > > > From: Jason Gunthorpe > > > > > > > > Sent: Thursday, April 1, 2021 7:47 PM > > > > > > > [...] > > > > > > > > I'm worried Intel views the only use of PASID in a guest is with > > > > > > > > ENQCMD, but that is not consistent with the industry. We need to > > see > > > > > > > > normal nested PASID support with assigned PCI VFs. > > > > > > > > > > > > > > I'm not quire flow here. Intel also allows PASID usage in guest > > without > > > > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without > > > > > > ENQCMD. > > > > > > > > > > > > Then you need all the parts, the hypervisor calls from the vIOMMU, > > and > > > > > > you can't really use a vPASID. > > > > > > > > > > This is a diagram shows the vSVA setup. > > > > > > > > I'm not talking only about vSVA. Generic PASID support with arbitary > > > > mappings. > > > > > > > > And how do you deal with the vPASID vs pPASID issue if the system has > > > > a mix of physical devices and mdevs? > > > > > > > > > > We plan to support two schemes. One is vPASID identity-mapped to > > > pPASID then the mixed scenario just works, with the limitation of > > > lacking of live migration support. The other is non-identity-mapped > > > scheme, where live migration is supported but physical devices and > > > mdevs should not be mixed in one VM if both expose SVA capability > > > (requires some filtering check in Qemu). > > > > That just becomes "block vPASID support if any device that > > doesn't use ENQCMD is plugged into the guest" > > The limitation is only for physical device. and in reality it is not that > bad. To support live migration with physical device we anyway need > additional work to migrate the device state (e.g. based on Max's work), > then it's not unreasonable to also mediate guest programming of > device specific PASID register to enable vPASID (need to translate in > the whole VM lifespan but likely is not a hot path). IMHO that is pretty unreasonable.. More likely we end up with vPASID tables in each migratable device like KVM has. > > Which needs a special VFIO capability of some kind so qemu knows to > > block it. This really needs to all be layed out together so someone > > can understand it :( > > Or could simply based on whether the VFIO device supports live migration. You need to define affirmative caps that indicate that vPASID will be supported by the VFIO device. > > Why doesn't the siov cookbook explaining this stuff?? > > > > > We hope the /dev/ioasid can support both schemes, with the minimal > > > requirement of allowing userspace to tag a vPASID to a pPASID and > > > allowing mdev to translate vPASID into pPASID, i.e. not assuming that > > > the guest will always use pPASID. > > > > What I'm a unclear of is if /dev/ioasid even needs to care about > > vPASID or if vPASID is just a hidden artifact of the KVM connection to > > setup the translation table and the vIOMMU driver in qemu. > > Not just for KVM. Also required by mdev, which needs to translate > vPASID into pPASID when ENQCMD is not used. Do we have any mdev's that will do this? > should only care about the operations related to pPASID. VFIO could > carry vPASID information to mdev. It depends how common this is, I suppose Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 11/12] iommu/arm-smmu: Implement the unmap_pages() IOMMU driver callback
On Mon, Apr 05, 2021 at 12:11:11PM -0700, Isaac J. Manjarres wrote: > Implement the unmap_pages() callback for the ARM SMMU driver > to allow calls from iommu_unmap to unmap multiple pages of > the same size in one call. > > Signed-off-by: Isaac J. Manjarres > Suggested-by: Will Deacon > --- > drivers/iommu/arm/arm-smmu/arm-smmu.c | 19 +++ > 1 file changed, 19 insertions(+) > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c > b/drivers/iommu/arm/arm-smmu/arm-smmu.c > index d8c6bfde6a61..f29f1fb109f8 100644 > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c > @@ -1225,6 +1225,24 @@ static size_t arm_smmu_unmap(struct iommu_domain > *domain, unsigned long iova, > return ret; > } > > +static size_t arm_smmu_unmap_pages(struct iommu_domain *domain, unsigned > long iova, > +size_t pgsize, size_t pgcount, > +struct iommu_iotlb_gather *iotlb_gather) > +{ > + struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops; > + struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu; > + size_t ret; > + > + if (!ops) > + return 0; > + > + arm_smmu_rpm_get(smmu); > + ret = ops->unmap_pages(ops, iova, pgsize, pgcount, iotlb_gather); > + arm_smmu_rpm_put(smmu); > + > + return ret; > +} Doesn't this go wrong if we're using the short-descriptor page-table format? (same for the next patch) Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 09/12] iommu/io-pgtable-arm: Implement arm_lpae_unmap_pages()
On Mon, Apr 05, 2021 at 12:11:09PM -0700, Isaac J. Manjarres wrote: > Implement the unmap_pages() callback for the ARM LPAE io-pgtable > format. > > Signed-off-by: Isaac J. Manjarres > Suggested-by: Will Deacon > --- > drivers/iommu/io-pgtable-arm.c | 124 +++-- > 1 file changed, 104 insertions(+), 20 deletions(-) [...] > +static size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long > iova, > +size_t pgsize, size_t pgcount, > +struct iommu_iotlb_gather *gather) > +{ > + struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops); > + struct io_pgtable_cfg *cfg = &data->iop.cfg; > + arm_lpae_iopte *ptep = data->pgd; > + long iaext = (s64)iova >> cfg->ias; > + size_t unmapped = 0, unmapped_page; > + int last_lvl; > + size_t table_size, pages, tbl_offset, max_entries; > + > + if (WARN_ON(!pgsize || (pgsize & cfg->pgsize_bitmap) != pgsize || > !pgcount)) > + return 0; > + > + if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1) > + iaext = ~iaext; > + if (WARN_ON(iaext)) > + return 0; > + > + /* > + * Calculating the page table size here helps avoid situations where > + * a page range that is being unmapped may be mapped at the same level > + * but not mapped by the same tables. Allowing such a scenario to > + * occur can complicate the logic in arm_lpae_split_blk_unmap(). > + */ > + last_lvl = ARM_LPAE_BLOCK_SIZE_LVL(pgsize, data); > + > + if (last_lvl == data->start_level) > + table_size = ARM_LPAE_PGD_SIZE(data); > + else > + table_size = ARM_LPAE_GRANULE(data); > + > + max_entries = table_size / sizeof(*ptep); > + > + while (pgcount) { > + tbl_offset = ARM_LPAE_LVL_IDX(iova, last_lvl, data); > + pages = min_t(size_t, pgcount, max_entries - tbl_offset); > + unmapped_page = __arm_lpae_unmap(data, gather, iova, pgsize, > + pages, data->start_level, > + ptep); > + if (!unmapped_page) > + break; > + > + unmapped += unmapped_page; > + iova += unmapped_page; > + pgcount -= pages; > + } Robin has some comments on the first version of this patch, and I don't think you addressed them: https://lore.kernel.org/r/b93fa0b1-e2a4-1aad-8b88-4d0dfecdf...@arm.com I'm inclined to agree that iterating here doesn't make a lot of sense -- we've already come back out of the page-table walk, so I think we should just return to the caller (who is well prepared to handle a partial unmap). Same for the map side of things. If we get numbers showing that this is causing a performance issue, then we should rework the page-table code to handle this at the lower level (because I doubt the loop that you have is really worse than returning to the caller anyway). Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Tue, Apr 06, 2021 at 12:37:35AM +, Tian, Kevin wrote: > With nested translation it is GVA->GPA->HPA. The kernel needs to > fix fault related to GPA->HPA (managed by VFIO/VDPA) while > handle_mm_fault only handles HVA->HPA. In this case, the 2nd-level > page fault is expected to be delivered to VFIO/VDPA first which then > find HVA related to GPA, call handle_mm_fault to fix HVA->HPA, > and then call iommu_map to fix GPA->HPA in the IOMMU page table. > This is exactly like how CPU EPT violation is handled. No, it should all be in the /dev/ioasid layer not duplicated into every user. > > If the fault needs to be fixed in the guest, then it needs to be > > delivered over /dev/ioasid in some way and injected into the > > vIOMMU. VFIO and VDPA have nothing to do with vIOMMU driver in quemu. > > > > You need to have an interface under /dev/ioasid to create both page > > table levels and part of that will be to tell the kernel what VA is > > mapped and how to handle faults. > > VFIO/VDPA already have their own interface to manage GPA->HPA > mappings. Why do we want to duplicate it in /dev/ioasid? They have their own interface to manage other types of HW, we should not duplicate PASID programming into there too. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 04/12] iommu: Add a map_pages() op for IOMMU drivers
On Mon, Apr 05, 2021 at 12:11:04PM -0700, Isaac J. Manjarres wrote: > Add a callback for IOMMU drivers to provide a path for the > IOMMU framework to call into an IOMMU driver, which can > call into the io-pgtable code, to map a physically contiguous > rnage of pages of the same size. > > For IOMMU drivers that do not specify a map_pages() callback, > the existing logic of mapping memory one page block at a time > will be used. > > Signed-off-by: Isaac J. Manjarres > Suggested-by: Will Deacon > --- > include/linux/iommu.h | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h > index 9cf81242581a..528d6a58479e 100644 > --- a/include/linux/iommu.h > +++ b/include/linux/iommu.h > @@ -192,6 +192,8 @@ struct iommu_iotlb_gather { > * @attach_dev: attach device to an iommu domain > * @detach_dev: detach device from an iommu domain > * @map: map a physically contiguous memory region to an iommu domain > + * @map_pages: map a physically contiguous set of pages of the same size to > + * an iommu domain. > * @unmap: unmap a physically contiguous memory region from an iommu domain > * @unmap_pages: unmap a number of pages of the same size from an iommu > domain > * @flush_iotlb_all: Synchronously flush all hardware TLBs for this domain > @@ -244,6 +246,9 @@ struct iommu_ops { > void (*detach_dev)(struct iommu_domain *domain, struct device *dev); > int (*map)(struct iommu_domain *domain, unsigned long iova, > phys_addr_t paddr, size_t size, int prot, gfp_t gfp); > + int (*map_pages)(struct iommu_domain *domain, unsigned long iova, > + phys_addr_t paddr, size_t pgsize, size_t pgcount, > + int prot, gfp_t gfp, size_t *mapped); (same comment as for the io-pgtable callback). Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 03/12] iommu/io-pgtable: Introduce map_pages() as a page table op
On Mon, Apr 05, 2021 at 12:11:03PM -0700, Isaac J. Manjarres wrote: > Mapping memory into io-pgtables follows the same semantics > that unmapping memory used to follow (i.e. a buffer will be > mapped one page block per call to the io-pgtable code). This > means that it can be optimized in the same way that unmapping > memory was, so add a map_pages() callback to the io-pgtable > ops structure, so that a range of pages of the same size > can be mapped within the same call. > > Signed-off-by: Isaac J. Manjarres > Suggested-by: Will Deacon > --- > include/linux/io-pgtable.h | 4 > 1 file changed, 4 insertions(+) > > diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h > index 2ed0c057d9e7..019149b204b8 100644 > --- a/include/linux/io-pgtable.h > +++ b/include/linux/io-pgtable.h > @@ -143,6 +143,7 @@ struct io_pgtable_cfg { > * struct io_pgtable_ops - Page table manipulation API for IOMMU drivers. > * > * @map: Map a physically contiguous memory region. > + * @map_pages:Map a physically contiguous range of pages of the same > size. > * @unmap:Unmap a physically contiguous memory region. > * @unmap_pages: Unmap a range of virtually contiguous pages of the same > size. > * @iova_to_phys: Translate iova to physical address. > @@ -153,6 +154,9 @@ struct io_pgtable_cfg { > struct io_pgtable_ops { > int (*map)(struct io_pgtable_ops *ops, unsigned long iova, > phys_addr_t paddr, size_t size, int prot, gfp_t gfp); > + int (*map_pages)(struct io_pgtable_ops *ops, unsigned long iova, > + phys_addr_t paddr, size_t pgsize, size_t pgcount, > + int prot, gfp_t gfp, size_t *mapped); How about returning 'size_t' and using IS_ERR_VALUE() instead of adding the extra 'mapped' argument (i.e. return the size of the region mapped or an error code)? I don't think we realistically need to care about map sizes that overlap with the error region. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 06/12] iommu: Split 'addr_merge' argument to iommu_pgsize() into separate parts
On Mon, Apr 05, 2021 at 12:11:06PM -0700, Isaac J. Manjarres wrote: > From: Will Deacon > > The 'addr_merge' parameter to iommu_pgsize() is a fabricated address > intended to describe the alignment requirements to consider when > choosing an appropriate page size. On the iommu_map() path, this address > is the logical OR of the virtual and physical addresses. > > Subsequent improvements to iommu_pgsize() will need to check the > alignment of the virtual and physical components of 'addr_merge' > independently, so pass them in as separate parameters and reconstruct > 'addr_merge' locally. > > No functional change. > > Signed-off-by: Will Deacon > Signed-off-by: Isaac J. Manjarres > --- > drivers/iommu/iommu.c | 10 ++ > 1 file changed, 6 insertions(+), 4 deletions(-) > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > index 9006397b6604..a3bbf7e310b0 100644 > --- a/drivers/iommu/iommu.c > +++ b/drivers/iommu/iommu.c > @@ -2357,12 +2357,13 @@ phys_addr_t iommu_iova_to_phys(struct iommu_domain > *domain, dma_addr_t iova) > } > EXPORT_SYMBOL_GPL(iommu_iova_to_phys); > > -static size_t iommu_pgsize(struct iommu_domain *domain, > -unsigned long addr_merge, size_t size) > +static size_t iommu_pgsize(struct iommu_domain *domain, unsigned long iova, > +phys_addr_t paddr, size_t size) > { > unsigned int pgsize_idx; > unsigned long pgsizes; > size_t pgsize; > + phys_addr_t addr_merge = paddr | iova; Huh, so this was 'unsigned long' before and, given that the pgsize_bitmap on the domain is also unsigned long, then I think that's fine. So using that would mean you don't need GENMASK_ULL for this guy either. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 05/12] iommu: Use bitmap to calculate page size in iommu_pgsize()
On Mon, Apr 05, 2021 at 12:11:05PM -0700, Isaac J. Manjarres wrote: > From: Will Deacon > > Avoid the potential for shifting values by amounts greater than the > width of their type by using a bitmap to compute page size in > iommu_pgsize(). > > Signed-off-by: Will Deacon > Signed-off-by: Isaac J. Manjarres > --- > drivers/iommu/iommu.c | 31 --- > 1 file changed, 12 insertions(+), 19 deletions(-) > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > index d0b0a15dba84..9006397b6604 100644 > --- a/drivers/iommu/iommu.c > +++ b/drivers/iommu/iommu.c > @@ -8,6 +8,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -2360,30 +2361,22 @@ static size_t iommu_pgsize(struct iommu_domain > *domain, > unsigned long addr_merge, size_t size) > { > unsigned int pgsize_idx; > + unsigned long pgsizes; > size_t pgsize; > > - /* Max page size that still fits into 'size' */ > - pgsize_idx = __fls(size); > + /* Page sizes supported by the hardware and small enough for @size */ > + pgsizes = domain->pgsize_bitmap & GENMASK_ULL(__fls(size), 0); See my comments on the other thread, but I don't think it's necessary to use the _ULL versions everywhere. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 3/6] iommu: Use bitmap to calculate page size in iommu_pgsize()
On Thu, Apr 01, 2021 at 06:39:35PM -0700, isa...@codeaurora.org wrote: > On 2021-04-01 09:47, Will Deacon wrote: > > Avoid the potential for shifting values by amounts greater than the > > width of their type by using a bitmap to compute page size in > > iommu_pgsize(). > > > > Signed-off-by: Will Deacon > > --- > > drivers/iommu/iommu.c | 31 --- > > 1 file changed, 12 insertions(+), 19 deletions(-) > > > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > > index d0b0a15dba84..bcd623862bf9 100644 > > --- a/drivers/iommu/iommu.c > > +++ b/drivers/iommu/iommu.c > > @@ -8,6 +8,7 @@ > > > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -2360,30 +2361,22 @@ static size_t iommu_pgsize(struct iommu_domain > > *domain, > >unsigned long addr_merge, size_t size) > > { > > unsigned int pgsize_idx; > > + unsigned long pgsizes; > > size_t pgsize; > > > > - /* Max page size that still fits into 'size' */ > > - pgsize_idx = __fls(size); > > + /* Page sizes supported by the hardware and small enough for @size */ > > + pgsizes = domain->pgsize_bitmap & GENMASK(__fls(size), 0); > I've fixed this in the latest RFC for the iommu_map/unmap optimization > patches, > but for the sake of completeness: I think this should be GENMASK_ULL, in > case > __fls(size) >= 32. Hmm, but 'size' is a size_t; which architectures have sizeof(size_t) > sizeof(unsigned long)? > > - /* need to consider alignment requirements ? */ > > - if (likely(addr_merge)) { > > - /* Max page size allowed by address */ > > - unsigned int align_pgsize_idx = __ffs(addr_merge); > > - pgsize_idx = min(pgsize_idx, align_pgsize_idx); > > - } > > - > > - /* build a mask of acceptable page sizes */ > > - pgsize = (1UL << (pgsize_idx + 1)) - 1; > > - > > - /* throw away page sizes not supported by the hardware */ > > - pgsize &= domain->pgsize_bitmap; > > + /* Constrain the page sizes further based on the maximum alignment */ > > + if (likely(addr_merge)) > > + pgsizes &= GENMASK(__ffs(addr_merge), 0); This one looks like more of an issue, though, as addr_merge is a phys_addr_t, which certainly can be 64-bit where unsigned long is 32-bit (e.g. Armv7 + LPAE) Rather than make everything _ULL, please can you just do it where we're actually using the larger types? Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu: Add device name to iommu map/unmap trace events
On Tue, Apr 06, 2021 at 02:56:53PM +0800, chenxiang (M) wrote: > Is it possible to use group id to identify different domains? No, the best is to do this during post-processing of your traces by also keeping tack of domain-device attachments/detachments. Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 0/4] iommu/iova: Add CPU hotplug handler to flush rcaches to core code
On 25/03/2021 17:53, Will Deacon wrote: On Thu, Mar 25, 2021 at 08:29:57PM +0800, John Garry wrote: The Intel IOMMU driver supports flushing the per-CPU rcaches when a CPU is offlined. Let's move it to core code, so everyone can take advantage. Also throw in a patch to stop exporting free_iova_fast(). Differences to v1: - Add RB tags (thanks!) - Add patch to stop exporting free_iova_fast() - Drop patch to correct comment - Add patch to delete iommu_dma_free_cpu_cached_iovas() and associated changes John Garry (4): iova: Add CPU hotplug handler to flush rcaches iommu/vt-d: Remove IOVA domain rcache flushing for CPU offlining iommu: Delete iommu_dma_free_cpu_cached_iovas() iommu: Stop exporting free_iova_fast() Looks like this is all set for 5.13, so hopefully Joerg can stick it in -next for a bit more exposure. Hi Joerg, Can you kindly consider picking this up now? Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v14 00/10] iommu: I/O page faults for SMMUv3
On Thu, Apr 01, 2021 at 06:15:02PM +0100, Will Deacon wrote: > On Thu, Apr 01, 2021 at 05:47:09PM +0200, Jean-Philippe Brucker wrote: > > Add stall support to the SMMUv3 driver, along with a common I/O Page > > Fault handler. > > > > Since [v13] I added review and ack tags (Thanks!), and a lockdep_assert. > > It would be good to have all of it in v5.13, since patch 10 introduces > > the first user for the IOPF interface from patch 6. But if that's not > > possible, please pick patches 1-6 so the Vt-d driver can start using > > them. > > Patches 1-7 look good to me, but I'm not convinced about the utility of > stalling faults so I'd prefer the later patches to come along with a > real user. As others said, it is possible to assign queues from the compression and crypto accelerators on the Kunpeng920 to userspace, using the uacce char device (upstream since last year, but waiting for implementations of the SVA API in IOMMU drivers). I've been using that platform for testing my code for the past year, with the UADK tool as well as an openssl plugin. Securely assignig a queue to userspace requires full SVA support in SMMUv3, which consists of PASID, page table sharing, and I/O page faults. The first two were already merged, and the third one requires either Stall or PRI. I'm not submitting PRI support at the moment because there is no hardware, but the Hisilicon platform implements stall and will be able to use it right away. Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu