Re: [PATCH v2 1/2] iommu/virtio: Make use of ops->iotlb_sync_map
On 2023-09-25 14:29, Jason Gunthorpe wrote: On Mon, Sep 25, 2023 at 02:07:50PM +0100, Robin Murphy wrote: On 2023-09-23 00:33, Jason Gunthorpe wrote: On Fri, Sep 22, 2023 at 07:07:40PM +0100, Robin Murphy wrote: virtio isn't setting ops->pgsize_bitmap for the sake of direct mappings either; it sets it once it's discovered any instance, since apparently it's assuming that all instances must support identical page sizes, and thus once it's seen one it can work "normally" per the core code's assumptions. It's also I think the only driver which has a "finalise" bodge but *can* still properly support map-before-attach, by virtue of having to replay mappings to every new endpoint anyway. Well it can't quite do that since it doesn't know the geometry - it all is sort of guessing and hoping it doesn't explode on replay. If it knows the geometry it wouldn't need finalize... I think it's entirely reasonable to assume that any direct mappings specified for a device are valid for that device and its IOMMU. However, in the particular case of virtio, it really shouldn't ever have direct mappings anyway, since even if the underlying hardware did have any, the host can enforce the actual direct-mapping aspect itself, and just present them as unusable regions to the guest. I assume this machinery is for the ARM GIC ITS page Again, that's irrelevant. It can only be about whether the actual ->map_pages call succeeds or not. A driver could well know up-front that all instances support the same pgsize_bitmap and aperture, and set both at ->domain_alloc time, yet still be unable to handle an actual mapping without knowing which instance(s) that needs to interact with (e.g. omap-iommu). I think this is a different issue. The domain is supposed to represent the actual io pte storage, and the storage is supposed to exist even when the domain is not attached to anything. As we said with tegra-gart, it is a bug in the driver if all the mappings disappear when the last device is detached from the domain. Driver bugs like this turn into significant issues with vfio/iommufd as this will result in warn_on's and memory leaking. So, I disagree that this is something we should be allowing in the API design. map_pages should succeed (memory allocation failures aside) if a IOVA within the aperture and valid flags are presented. Regardless of the attachment status. Calling map_pages with an IOVA outside the aperture should be a caller bug. It looks omap is just mis-designed to store the pgd in the omap_iommu, not the omap_iommu_domain :( pgd is clearly a per-domain object in our API. And why does every instance need its own copy of the identical pgd? The point wasn't that it was necessarily a good and justifiable example, just that it is one that exists, to demonstrate that in general we have no reasonable heuristic for guessing whether ->map_pages is going to succeed or not other than by calling it and seeing if it succeeds or not. And IMO it's a complete waste of time thinking about ways to make such a heuristic possible instead of just getting on with fixing iommu_domain_alloc() to make the problem disappear altogether. Once Joerg pushes out the current queue I'll rebase and resend v4 of the bus ops removal, then hopefully get back to despairing at the hideous pile of WIP iommu_domain_alloc() patches I currently have on top of it... Thanks, Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 1/2] iommu/virtio: Make use of ops->iotlb_sync_map
On 2023-09-23 00:33, Jason Gunthorpe wrote: On Fri, Sep 22, 2023 at 07:07:40PM +0100, Robin Murphy wrote: virtio isn't setting ops->pgsize_bitmap for the sake of direct mappings either; it sets it once it's discovered any instance, since apparently it's assuming that all instances must support identical page sizes, and thus once it's seen one it can work "normally" per the core code's assumptions. It's also I think the only driver which has a "finalise" bodge but *can* still properly support map-before-attach, by virtue of having to replay mappings to every new endpoint anyway. Well it can't quite do that since it doesn't know the geometry - it all is sort of guessing and hoping it doesn't explode on replay. If it knows the geometry it wouldn't need finalize... I think it's entirely reasonable to assume that any direct mappings specified for a device are valid for that device and its IOMMU. However, in the particular case of virtio, it really shouldn't ever have direct mappings anyway, since even if the underlying hardware did have any, the host can enforce the actual direct-mapping aspect itself, and just present them as unusable regions to the guest. What do you think about something like this to replace iommu_create_device_direct_mappings(), that does enforce things properly? I fail to see how that would make any practical difference. Either the mappings can be correctly set up in a pagetable *before* the relevant device is attached to that pagetable, or they can't (if the driver doesn't have enough information to be able to do so) and we just have to really hope nothing blows up in the race window between attaching the device to an empty pagetable and having a second try at iommu_create_device_direct_mappings(). That's a driver-level issue and has nothing to do with pgsize_bitmap either way. Except we don't detect this in the core code correctly, that is my point. We should detect the aperture conflict, not pgsize_bitmap to check if it is the first or second try. Again, that's irrelevant. It can only be about whether the actual ->map_pages call succeeds or not. A driver could well know up-front that all instances support the same pgsize_bitmap and aperture, and set both at ->domain_alloc time, yet still be unable to handle an actual mapping without knowing which instance(s) that needs to interact with (e.g. omap-iommu). Thanks, Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 1/2] iommu/virtio: Make use of ops->iotlb_sync_map
On 22/09/2023 5:27 pm, Jason Gunthorpe wrote: On Fri, Sep 22, 2023 at 02:13:18PM +0100, Robin Murphy wrote: On 22/09/2023 1:41 pm, Jason Gunthorpe wrote: On Fri, Sep 22, 2023 at 08:57:19AM +0100, Jean-Philippe Brucker wrote: They're not strictly equivalent: this check works around a temporary issue with the IOMMU core, which calls map/unmap before the domain is finalized. Where? The above points to iommu_create_device_direct_mappings() but it doesn't because the pgsize_bitmap == 0: __iommu_domain_alloc() sets pgsize_bitmap in this case: /* * If not already set, assume all sizes by default; the driver * may override this later */ if (!domain->pgsize_bitmap) domain->pgsize_bitmap = bus->iommu_ops->pgsize_bitmap; Dirver's shouldn't do that. The core code was fixed to try again with mapping reserved regions to support these kinds of drivers. This is still the "normal" code path, really; I think it's only AMD that started initialising the domain bitmap "early" and warranted making it conditional. My main point was that iommu_create_device_direct_mappings() should fail for unfinalized domains, setting pgsize_bitmap to allow it to succeed is not a nice hack, and not necessary now. Sure, but it's the whole "unfinalised domains" and rewriting domain->pgsize_bitmap after attach thing that is itself the massive hack. AMD doesn't do that, and doesn't need to; it knows the appropriate format at allocation time and can quite happily return a fully working domain which allows map before attach, but the old ops->pgsize_bitmap mechanism fundamentally doesn't work for multiple formats with different page sizes. The only thing I'd accuse it of doing wrong is the weird half-and-half thing of having one format as a default via one mechanism, and the other as an override through the other, rather than setting both explicitly. virtio isn't setting ops->pgsize_bitmap for the sake of direct mappings either; it sets it once it's discovered any instance, since apparently it's assuming that all instances must support identical page sizes, and thus once it's seen one it can work "normally" per the core code's assumptions. It's also I think the only driver which has a "finalise" bodge but *can* still properly support map-before-attach, by virtue of having to replay mappings to every new endpoint anyway. What do you think about something like this to replace iommu_create_device_direct_mappings(), that does enforce things properly? I fail to see how that would make any practical difference. Either the mappings can be correctly set up in a pagetable *before* the relevant device is attached to that pagetable, or they can't (if the driver doesn't have enough information to be able to do so) and we just have to really hope nothing blows up in the race window between attaching the device to an empty pagetable and having a second try at iommu_create_device_direct_mappings(). That's a driver-level issue and has nothing to do with pgsize_bitmap either way. Thanks, Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 1/2] iommu/virtio: Make use of ops->iotlb_sync_map
On 22/09/2023 1:41 pm, Jason Gunthorpe wrote: On Fri, Sep 22, 2023 at 08:57:19AM +0100, Jean-Philippe Brucker wrote: They're not strictly equivalent: this check works around a temporary issue with the IOMMU core, which calls map/unmap before the domain is finalized. Where? The above points to iommu_create_device_direct_mappings() but it doesn't because the pgsize_bitmap == 0: __iommu_domain_alloc() sets pgsize_bitmap in this case: /* * If not already set, assume all sizes by default; the driver * may override this later */ if (!domain->pgsize_bitmap) domain->pgsize_bitmap = bus->iommu_ops->pgsize_bitmap; Dirver's shouldn't do that. The core code was fixed to try again with mapping reserved regions to support these kinds of drivers. This is still the "normal" code path, really; I think it's only AMD that started initialising the domain bitmap "early" and warranted making it conditional. However we *do* ultimately want all the drivers to do the same, so we can get rid of ops->pgsize_bitmap, because it's already pretty redundant and meaningless in the face of per-domain pagetable formats. Thanks, Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 1/2] iommu/virtio: Make use of ops->iotlb_sync_map
On 2023-09-19 09:15, Jean-Philippe Brucker wrote: On Mon, Sep 18, 2023 at 05:37:47PM +0100, Robin Murphy wrote: diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 17dcd826f5c2..3649586f0e5c 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -189,6 +189,12 @@ static int viommu_sync_req(struct viommu_dev *viommu) int ret; unsigned long flags; + /* +* .iotlb_sync_map and .flush_iotlb_all may be called before the viommu +* is initialized e.g. via iommu_create_device_direct_mappings() +*/ + if (!viommu) + return 0; Minor nit: I'd be inclined to make that check explicitly in the places where it definitely is expected, rather than allowing *any* sync to silently do nothing if called incorrectly. Plus then they could use vdomain->nr_endpoints for consistency with the equivalent checks elsewhere (it did take me a moment to figure out how we could get to .iotlb_sync_map with a NULL viommu without viommu_map_pages() blowing up first...) They're not strictly equivalent: this check works around a temporary issue with the IOMMU core, which calls map/unmap before the domain is finalized. Once we merge domain_alloc() and finalize(), then this check disappears, but we still need to test nr_endpoints in map/unmap to handle detached domains (and we still need to fix the synchronization of nr_endpoints against attach/detach). That's why I preferred doing this on viommu and keeping it in one place. Fair enough - it just seems to me that in both cases it's a detached domain, so its previous history of whether it's ever been otherwise or not shouldn't matter. Even once viommu is initialised, does it really make sense to send sync commands for a mapping on a detached domain where we haven't actually sent any map/unmap commands? Thanks, Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 1/2] iommu/virtio: Make use of ops->iotlb_sync_map
On 2023-09-18 12:51, Niklas Schnelle wrote: Pull out the sync operation from viommu_map_pages() by implementing ops->iotlb_sync_map. This allows the common IOMMU code to map multiple elements of an sg with a single sync (see iommu_map_sg()). Furthermore, it is also a requirement for IOMMU_CAP_DEFERRED_FLUSH. Is it really a requirement? Deferred flush only deals with unmapping. Or are you just trying to say that it's not too worthwhile to try doing more for unmapping performance while obvious mapping performance is still left on the table? Link: https://lore.kernel.org/lkml/20230726111433.1105665-1-schne...@linux.ibm.com/ Signed-off-by: Niklas Schnelle --- drivers/iommu/virtio-iommu.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 17dcd826f5c2..3649586f0e5c 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -189,6 +189,12 @@ static int viommu_sync_req(struct viommu_dev *viommu) int ret; unsigned long flags; + /* +* .iotlb_sync_map and .flush_iotlb_all may be called before the viommu +* is initialized e.g. via iommu_create_device_direct_mappings() +*/ + if (!viommu) + return 0; Minor nit: I'd be inclined to make that check explicitly in the places where it definitely is expected, rather than allowing *any* sync to silently do nothing if called incorrectly. Plus then they could use vdomain->nr_endpoints for consistency with the equivalent checks elsewhere (it did take me a moment to figure out how we could get to .iotlb_sync_map with a NULL viommu without viommu_map_pages() blowing up first...) Thanks, Robin. spin_lock_irqsave(>request_lock, flags); ret = __viommu_sync_req(viommu); if (ret) @@ -843,7 +849,7 @@ static int viommu_map_pages(struct iommu_domain *domain, unsigned long iova, .flags = cpu_to_le32(flags), }; - ret = viommu_send_req_sync(vdomain->viommu, , sizeof(map)); + ret = viommu_add_req(vdomain->viommu, , sizeof(map)); if (ret) { viommu_del_mappings(vdomain, iova, end); return ret; @@ -912,6 +918,14 @@ static void viommu_iotlb_sync(struct iommu_domain *domain, viommu_sync_req(vdomain->viommu); } +static int viommu_iotlb_sync_map(struct iommu_domain *domain, +unsigned long iova, size_t size) +{ + struct viommu_domain *vdomain = to_viommu_domain(domain); + + return viommu_sync_req(vdomain->viommu); +} + static void viommu_get_resv_regions(struct device *dev, struct list_head *head) { struct iommu_resv_region *entry, *new_entry, *msi = NULL; @@ -1058,6 +1072,7 @@ static struct iommu_ops viommu_ops = { .unmap_pages= viommu_unmap_pages, .iova_to_phys = viommu_iova_to_phys, .iotlb_sync = viommu_iotlb_sync, + .iotlb_sync_map = viommu_iotlb_sync_map, .free = viommu_domain_free, } }; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] iommu/virtio: Add ops->flush_iotlb_all and enable deferred flush
On 2023-09-04 16:34, Jean-Philippe Brucker wrote: On Fri, Aug 25, 2023 at 05:21:26PM +0200, Niklas Schnelle wrote: Add ops->flush_iotlb_all operation to enable virtio-iommu for the dma-iommu deferred flush scheme. This results inn a significant increase in in performance in exchange for a window in which devices can still access previously IOMMU mapped memory. To get back to the prior behavior iommu.strict=1 may be set on the kernel command line. Maybe add that it depends on CONFIG_IOMMU_DEFAULT_DMA_{LAZY,STRICT} as well, because I've seen kernel configs that enable either. Indeed, I'd be inclined phrase it in terms of the driver now actually being able to honour lazy mode when requested (which happens to be the default on x86), rather than as if it might be some potentially-unexpected change in behaviour. Thanks, Robin. Link: https://lore.kernel.org/lkml/20230802123612.GA6142@myrica/ Signed-off-by: Niklas Schnelle --- drivers/iommu/virtio-iommu.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index fb73dec5b953..1b7526494490 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -924,6 +924,15 @@ static int viommu_iotlb_sync_map(struct iommu_domain *domain, return viommu_sync_req(vdomain->viommu); } +static void viommu_flush_iotlb_all(struct iommu_domain *domain) +{ + struct viommu_domain *vdomain = to_viommu_domain(domain); + + if (!vdomain->nr_endpoints) + return; As for patch 1, a NULL check in viommu_sync_req() would allow dropping this one Thanks, Jean + viommu_sync_req(vdomain->viommu); +} + static void viommu_get_resv_regions(struct device *dev, struct list_head *head) { struct iommu_resv_region *entry, *new_entry, *msi = NULL; @@ -1049,6 +1058,8 @@ static bool viommu_capable(struct device *dev, enum iommu_cap cap) switch (cap) { case IOMMU_CAP_CACHE_COHERENCY: return true; + case IOMMU_CAP_DEFERRED_FLUSH: + return true; default: return false; } @@ -1069,6 +1080,7 @@ static struct iommu_ops viommu_ops = { .map_pages = viommu_map_pages, .unmap_pages= viommu_unmap_pages, .iova_to_phys = viommu_iova_to_phys, + .flush_iotlb_all= viommu_flush_iotlb_all, .iotlb_sync = viommu_iotlb_sync, .iotlb_sync_map = viommu_iotlb_sync_map, .free = viommu_domain_free, -- 2.39.2 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] iommu: Explicitly include correct DT includes
On 14/07/2023 6:46 pm, Rob Herring wrote: The DT of_device.h and of_platform.h date back to the separate of_platform_bus_type before it as merged into the regular platform bus. As part of that merge prepping Arm DT support 13 years ago, they "temporarily" include each other. They also include platform_device.h and of.h. As a result, there's a pretty much random mix of those include files used throughout the tree. In order to detangle these headers and replace the implicit includes with struct declarations, users need to explicitly include the correct includes. Thanks Rob; FWIW, Acked-by: Robin Murphy I guess you're hoping for Joerg to pick this up? However I wouldn't foresee any major conflicts if you do need to take it through the OF tree. Cheers, Robin. Signed-off-by: Rob Herring --- drivers/iommu/arm/arm-smmu/arm-smmu-qcom-debug.c | 2 +- drivers/iommu/arm/arm-smmu/arm-smmu.c| 1 - drivers/iommu/arm/arm-smmu/qcom_iommu.c | 3 +-- drivers/iommu/ipmmu-vmsa.c | 1 - drivers/iommu/sprd-iommu.c | 1 + drivers/iommu/tegra-smmu.c | 2 +- drivers/iommu/virtio-iommu.c | 2 +- 7 files changed, 5 insertions(+), 7 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom-debug.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom-debug.c index b5b14108e086..bb89d49adf8d 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom-debug.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom-debug.c @@ -3,7 +3,7 @@ * Copyright (c) 2022 Qualcomm Innovation Center, Inc. All rights reserved. */ -#include +#include #include #include diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c index a86acd76c1df..d6d1a2a55cc0 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c @@ -29,7 +29,6 @@ #include #include #include -#include #include #include #include diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c b/drivers/iommu/arm/arm-smmu/qcom_iommu.c index a503ed758ec3..cc3f68a3516c 100644 --- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c +++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c @@ -22,8 +22,7 @@ #include #include #include -#include -#include +#include #include #include #include diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c index 9f64c5c9f5b9..0aeedd3e1494 100644 --- a/drivers/iommu/ipmmu-vmsa.c +++ b/drivers/iommu/ipmmu-vmsa.c @@ -17,7 +17,6 @@ #include #include #include -#include #include #include #include diff --git a/drivers/iommu/sprd-iommu.c b/drivers/iommu/sprd-iommu.c index 39e34fdeccda..51144c232474 100644 --- a/drivers/iommu/sprd-iommu.c +++ b/drivers/iommu/sprd-iommu.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c index 1cbf063ccf14..e445f80d0226 100644 --- a/drivers/iommu/tegra-smmu.c +++ b/drivers/iommu/tegra-smmu.c @@ -9,7 +9,7 @@ #include #include #include -#include +#include #include #include #include diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 3551ed057774..17dcd826f5c2 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -13,7 +13,7 @@ #include #include #include -#include +#include #include #include #include ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 04/10] iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous()
On 2023-01-18 18:00, Jason Gunthorpe wrote: Change the sg_alloc_table_from_pages() allocation that was hardwired to GFP_KERNEL to use the gfp parameter like the other allocations in this function. Auditing says this is never called from an atomic context, so it is safe as is, but reads wrong. I think the point may have been that the sgtable metadata is a logically-distinct allocation from the buffer pages themselves. Much like the allocation of the pages array itself further down in __iommu_dma_alloc_pages(). I see these days it wouldn't be catastrophic to pass GFP_HIGHMEM into __get_free_page() via sg_kmalloc(), but still, allocating implementation-internal metadata with all the same constraints as a DMA buffer has just as much smell of wrong about it IMO. I'd say the more confusing thing about this particular context is why we're using iommu_map_sg_atomic() further down - that seems to have been an oversight in 781ca2de89ba, since this particular path has never supported being called in atomic context. Overall I'm starting to wonder if it might not be better to stick a "use GFP_KERNEL_ACCOUNT if you allocate" flag in the domain for any level of the API internals to pick up as appropriate, rather than propagate per-call gfp flags everywhere. As it stands we're still missing potential pagetable and other domain-related allocations by drivers in .attach_dev and even (in probably-shouldn't-really-happen cases) .unmap_pages... Thanks, Robin. Signed-off-by: Jason Gunthorpe --- drivers/iommu/dma-iommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 8c2788633c1766..e4bf1bb159f7c7 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -822,7 +822,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev, if (!iova) goto out_free_pages; - if (sg_alloc_table_from_pages(sgt, pages, count, 0, size, GFP_KERNEL)) + if (sg_alloc_table_from_pages(sgt, pages, count, 0, size, gfp)) goto out_free_iova; if (!(ioprot & IOMMU_CACHE)) { ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/8] iommu: Add a gfp parameter to iommu_map()
On 2023-01-06 16:42, Jason Gunthorpe wrote: The internal mechanisms support this, but instead of exposting the gfp to the caller it wrappers it into iommu_map() and iommu_map_atomic() Fix this instead of adding more variants for GFP_KERNEL_ACCOUNT. FWIW, since we *do* have two variants already, I think I'd have a mild preference for leaving the regular map calls as-is (i.e. implicit GFP_KERNEL), and just generalising the _atomic versions for the special cases. However, echoing the recent activity over on the DMA API side of things, I think it's still worth proactively constraining the set of permissible flags, lest we end up with more weird problems if stuff that doesn't really make sense, like GFP_COMP or zone flags, manages to leak through (that may have been part of the reason for having the current wrappers rather than a bare gfp argument in the first place, I forget now). Thanks, Robin. Signed-off-by: Jason Gunthorpe --- arch/arm/mm/dma-mapping.c | 11 +++ .../gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c | 3 ++- drivers/gpu/drm/tegra/drm.c | 2 +- drivers/gpu/host1x/cdma.c | 2 +- drivers/infiniband/hw/usnic/usnic_uiom.c| 4 ++-- drivers/iommu/dma-iommu.c | 2 +- drivers/iommu/iommu.c | 17 ++--- drivers/iommu/iommufd/pages.c | 6 -- drivers/media/platform/qcom/venus/firmware.c| 2 +- drivers/net/ipa/ipa_mem.c | 6 -- drivers/net/wireless/ath/ath10k/snoc.c | 2 +- drivers/net/wireless/ath/ath11k/ahb.c | 4 ++-- drivers/remoteproc/remoteproc_core.c| 5 +++-- drivers/vfio/vfio_iommu_type1.c | 9 + drivers/vhost/vdpa.c| 2 +- include/linux/iommu.h | 4 ++-- 16 files changed, 43 insertions(+), 38 deletions(-) diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index c135f6e37a00ca..8bc01071474ab7 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -984,7 +984,8 @@ __iommu_create_mapping(struct device *dev, struct page **pages, size_t size, len = (j - i) << PAGE_SHIFT; ret = iommu_map(mapping->domain, iova, phys, len, - __dma_info_to_prot(DMA_BIDIRECTIONAL, attrs)); + __dma_info_to_prot(DMA_BIDIRECTIONAL, attrs), + GFP_KERNEL); if (ret < 0) goto fail; iova += len; @@ -1207,7 +1208,8 @@ static int __map_sg_chunk(struct device *dev, struct scatterlist *sg, prot = __dma_info_to_prot(dir, attrs); - ret = iommu_map(mapping->domain, iova, phys, len, prot); + ret = iommu_map(mapping->domain, iova, phys, len, prot, + GFP_KERNEL); if (ret < 0) goto fail; count += len >> PAGE_SHIFT; @@ -1379,7 +1381,8 @@ static dma_addr_t arm_iommu_map_page(struct device *dev, struct page *page, prot = __dma_info_to_prot(dir, attrs); - ret = iommu_map(mapping->domain, dma_addr, page_to_phys(page), len, prot); + ret = iommu_map(mapping->domain, dma_addr, page_to_phys(page), len, + prot, GFP_KERNEL); if (ret < 0) goto fail; @@ -1443,7 +1446,7 @@ static dma_addr_t arm_iommu_map_resource(struct device *dev, prot = __dma_info_to_prot(dir, attrs) | IOMMU_MMIO; - ret = iommu_map(mapping->domain, dma_addr, addr, len, prot); + ret = iommu_map(mapping->domain, dma_addr, addr, len, prot, GFP_KERNEL); if (ret < 0) goto fail; diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c b/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c index 648ecf5a8fbc2a..a4ac94a2ab57fc 100644 --- a/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c @@ -475,7 +475,8 @@ gk20a_instobj_ctor_iommu(struct gk20a_instmem *imem, u32 npages, u32 align, u32 offset = (r->offset + i) << imem->iommu_pgshift; ret = iommu_map(imem->domain, offset, node->dma_addrs[i], - PAGE_SIZE, IOMMU_READ | IOMMU_WRITE); + PAGE_SIZE, IOMMU_READ | IOMMU_WRITE, + GFP_KERNEL); if (ret < 0) { nvkm_error(subdev, "IOMMU mapping failure: %d\n", ret); diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c index 7bd2e65c2a16c5..6ca9f396e55be4 100644 --- a/drivers/gpu/drm/tegra/drm.c +++ b/drivers/gpu/drm/tegra/drm.c @@ -1057,7 +1057,7 @@ void *tegra_drm_alloc(struct tegra_drm *tegra, size_t size, dma_addr_t *dma) *dma = iova_dma_addr(>carveout.domain, alloc); err =
Re: The arm smmu driver for Linux does not support debugfs
On 2022-11-15 02:28, leo-...@hotmail.com wrote: Hi, Why doesn't the arm smmu driver for Linux support debugfs ? Because nobody's ever written any debugfs code for it. Are there any historical reasons? Only that so far nobody's needed to. TBH, arm-smmu is actually quite straightforward, and none of the internal driver state is really all that interesting (other than the special private Adreno stuff, but we leave it up to Rob to implement whatever he needs there). Given the kernel config, module parameters, and the features logged at probe, you can already infer how it will set up context banks etc. for regular IOMMU API work; there won't be any surprises. At this point there shouldn't be any need to debug the driver itself, it's mature and stable. For debugging *users* of the driver, I've only dealt with the DMA layer, where a combination of the IOMMU API tracepoints, CONFIG_DMA_API_DEBUG, and my own hacks to iommu-dma have always proved sufficient to get enough insight into what's being mapped where. I think a couple of people have previously raised the idea of implementing some kind of debugfs dumping for io-pgtable, but nothing's ever come of it. As above, it often turns out that you can find the information you need from other existing sources, thus the effort of implementing and maintaining a load of special-purpose debug code can be saved. In particular it would not be worth having driver-specific code that only helps debug generic IOMMU API usage - that would be much better implemented at the generic IOMMU API level. Thanks, Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 4/5] iommu: Regulate errno in ->attach_dev callback functions
On 2022-09-14 18:58, Nicolin Chen wrote: On Wed, Sep 14, 2022 at 10:49:42AM +0100, Jean-Philippe Brucker wrote: External email: Use caution opening links or attachments On Wed, Sep 14, 2022 at 06:11:06AM -0300, Jason Gunthorpe wrote: On Tue, Sep 13, 2022 at 01:27:03PM +0100, Jean-Philippe Brucker wrote: I think in the future it will be too easy to forget about the constrained return value of attach() while modifying some other part of the driver, and let an external helper return EINVAL. So I'd rather not propagate ret from outside of viommu_domain_attach() and finalise(). Fortunately, if -EINVAL is wrongly returned it only creates an inefficiency, not a functional problem. So we do not need to be precise here. Ah fair. In that case the attach_dev() documentation should indicate that EINVAL is a hint, so that callers don't rely on it (currently words "must" and "exclusively" indicate that returning EINVAL for anything other than device-domain incompatibility is unacceptable). The virtio-iommu implementation may well return EINVAL from the virtio stack or from the host response. How about this? + * * EINVAL- mainly, device and domain are incompatible, or something went + * wrong with the domain. It's suggested to avoid kernel prints + * along with this errno. And it's better to convert any EINVAL + * returned from kAPIs to ENODEV if it is device-specific, or to + * some other reasonable errno being listed below FWIW, I'd say something like: "The device and domain are incompatible. If this is due to some previous configuration of the domain, drivers should not log an error, since it is legitimate for callers to test reuse of an existing domain. Otherwise, it may still represent some fundamental problem." And then at the public interfaces state it from other angle: "The device and domain are incompatible. If the domain has already been used or configured in some way, attaching the same device to a different domain may be expected to succeed. Otherwise, it may still represent some fundamental problem." [ and to save another mail, I'm not sure copying the default comment for ENOSPC is all that helpful either - what is "space" for something that isn't a storage device? I'd guess limited hardware resources in some form, but in the IOMMU context, potential confusion with address space is maybe a little too close for comfort? ] Since we can't guarantee that APIs like virtio or ida won't ever return EINVAL, we should set all return values: I dislike this alot, it squashes all return codes to try to optimize an obscure failure path :( Hmm...should I revert all the driver changes back to this version? Yeah, I don't think we need to go too mad here. Drivers shouldn't emit their *own* -EINVAL unless appropriate, but if it comes back from some external API then that implies something's gone unexpectedly wrong anyway - maybe it's a transient condition and a subsequent different attach might actually work out OK? We can't really say in general. Besides, if the driver sees an error which implies it's done something wrong itself, it probably shouldn't be trusted to try to reason about it further. The caller can handle any error as long as we set their expectations correctly. Thanks, Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v6 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group
On 2022-09-08 01:43, Jason Gunthorpe wrote: On Wed, Sep 07, 2022 at 08:41:13PM +0100, Robin Murphy wrote: FWIW, we're now very close to being able to validate dev->iommu against where the domain came from in core code, and so short-circuit ->attach_dev entirely if they don't match. I don't think this is a long term direction. We have systems now with a number of SMMU blocks and we really are going to see a need that they share the iommu_domains so we don't have unncessary overheads from duplicated io page table memory. So ultimately I'd expect to pass the iommu_domain to the driver and the driver will decide if the page table memory it represents is compatible or not. Restricting to only the same iommu instance isn't good.. Who said IOMMU instance? Ah, I completely misunderstood what 'dev->iommu' was referring too, OK I see. Again, not what I was suggesting. In fact the nature of iommu_attach_group() already rules out bogus devices getting this far, so all a driver currently has to worry about is compatibility of a device that it definitely probed with a domain that it definitely allocated. Therefore, from a caller's point of view, if attaching to an existing domain returns -EINVAL, try another domain; multiple different existing domains can be tried, and may also return -EINVAL for the same or different reasons; the final attempt is to allocate a fresh domain and attach to that, which should always be nominally valid and *never* return -EINVAL. If any attempt returns any other error, bail out down the usual "this should have worked but something went wrong" path. Even if any driver did have a nonsensical "nothing went wrong, I just can't attach my device to any of my domains" case, I don't think it would really need distinguishing from any other general error anyway. The algorithm you described is exactly what this series does, it just used EMEDIUMTYPE instead of EINVAL. Changing it to EINVAL is not a fundamental problem, just a bit more work. Looking at Nicolin's series there is a bunch of existing errnos that would still need converting, ie EXDEV, EBUSY, EOPNOTSUPP, EFAULT, and ENXIO are all returned as codes for 'domain incompatible with device' in various drivers. So the patch would still look much the same, just changing them to EINVAL instead of EMEDIUMTYPE. That leaves the question of the remaining EINVAL's that Nicolin did not convert to EMEDIUMTYPE. eg in the AMD driver: if (!check_device(dev)) return -EINVAL; iommu = rlookup_amd_iommu(dev); if (!iommu) return -EINVAL; These are all cases of 'something is really wrong with the device or iommu, everything will fail'. Other drivers are using ENODEV for this already, so we'd probably have an additional patch changing various places like that to ENODEV. This mixture of error codes is the basic reason why a new code was used, because none of the existing codes are used with any consistency. But OK, I'm on board, lets use more common errnos with specific meaning, that can be documented in a comment someplace: ENOMEM - out of memory ENODEV - no domain can attach, device or iommu is messed up EINVAL - the domain is incompatible with the device - Same behavior as ENODEV, use is discouraged. I think achieving consistency of error codes is a generally desirable goal, it makes the error code actually useful. Joerg this is a good bit of work, will you be OK with it? Thus as long as we can maintain that basic guarantee that attaching a group to a newly allocated domain can only ever fail for resource allocation reasons and not some spurious "incompatibility", then we don't need any obscure trickery, and a single, clear, error code is in fact enough to say all that needs to be said. As above, this is not the case, drivers do seem to have error paths that are unconditional on the domain. Perhaps they are just protective assertions and never happen. Right, that's the gist of what I was getting at - I think it's worth putting in the effort to audit and fix the drivers so that that *can* be the case, then we can have a meaningful error API with standard codes effectively for free, rather than just sighing at the existing mess and building a slightly esoteric special case on top. Case in point, the AMD checks quoted above are pointless, since it checks the same things in ->probe_device, and if that fails then the device won't get a group so there's no way for it to even reach ->attach_dev any more. I'm sure there's a *lot* of cruft that can be cleared out now that per-device and per-domain ops give us this kind of inherent robustness. Cheers, Robin. Regardless, it doesn't matter. If they return ENODEV or EINVAL the VFIO side algorithm will continue to work fine, it just does alot more work if EINVAL is permanently returned. Thanks, Jason ___ Virtualization mailing l
Re: [PATCH v6 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group
On 2022-09-07 18:00, Jason Gunthorpe wrote: On Wed, Sep 07, 2022 at 03:23:09PM +0100, Robin Murphy wrote: On 2022-09-07 14:47, Jason Gunthorpe wrote: On Wed, Sep 07, 2022 at 02:41:54PM +0200, Joerg Roedel wrote: On Mon, Aug 15, 2022 at 11:14:33AM -0700, Nicolin Chen wrote: Provide a dedicated errno from the IOMMU driver during attach that the reason attached failed is because of domain incompatability. EMEDIUMTYPE is chosen because it is never used within the iommu subsystem today and evokes a sense that the 'medium' aka the domain is incompatible. I am not a fan of re-using EMEDIUMTYPE or any other special value. What is needed here in EINVAL, but with a way to tell the caller which of the function parameters is actually invalid. Using errnos to indicate the nature of failure is a well established unix practice, it is why we have hundreds of error codes and don't just return -EINVAL for everything. What don't you like about it? Would you be happier if we wrote it like #define IOMMU_EINCOMPATIBLE_DEVICE xx Which tells "which of the function parameters is actually invalid" ? FWIW, we're now very close to being able to validate dev->iommu against where the domain came from in core code, and so short-circuit ->attach_dev entirely if they don't match. I don't think this is a long term direction. We have systems now with a number of SMMU blocks and we really are going to see a need that they share the iommu_domains so we don't have unncessary overheads from duplicated io page table memory. So ultimately I'd expect to pass the iommu_domain to the driver and the driver will decide if the page table memory it represents is compatible or not. Restricting to only the same iommu instance isn't good.. Who said IOMMU instance? As a reminder, the patch I currently have[1] is matching the driver (via the device ops), which happens to be entirely compatible with drivers supporting cross-instance domains. Mostly because we already have drivers that support cross-instance domains and callers that use them. At that point -EINVAL at the driver callback level could be assumed to refer to the domain argument, while anything else could be taken as something going unexpectedly wrong when the attach may otherwise have worked. I've forgotten if we actually had a valid case anywhere for "this is my device but even if you retry with a different domain it's still never going to work", but I think we wouldn't actually need that anyway - it should be clear enough to a caller that if attaching to an existing domain fails, then allocating a fresh domain and attaching also fails, that's the point to give up. The point was to have clear error handling, we either have permenent errors or 'this domain will never work with this device error'. If we treat all error as temporary and just retry randomly it can create a mess. For instance we might fail to attach to a perfectly compatible domain due to ENOMEM or something and then go on to successfully a create a new 2nd domain, just due to races. We can certainly code the try everything then allocate scheme, it is just much more fragile than having definitive error codes. Again, not what I was suggesting. In fact the nature of iommu_attach_group() already rules out bogus devices getting this far, so all a driver currently has to worry about is compatibility of a device that it definitely probed with a domain that it definitely allocated. Therefore, from a caller's point of view, if attaching to an existing domain returns -EINVAL, try another domain; multiple different existing domains can be tried, and may also return -EINVAL for the same or different reasons; the final attempt is to allocate a fresh domain and attach to that, which should always be nominally valid and *never* return -EINVAL. If any attempt returns any other error, bail out down the usual "this should have worked but something went wrong" path. Even if any driver did have a nonsensical "nothing went wrong, I just can't attach my device to any of my domains" case, I don't think it would really need distinguishing from any other general error anyway. Once multiple drivers are in play, the only addition is that the "gatekeeper" check inside iommu_attach_group() may also return -EINVAL if the device is managed by a different driver, since that still fits the same "try again with a different domain" message to the caller. It's actually quite neat - basically the exact same thing we've tried to do with -EMEDIUMTYPE here, but more self-explanatory, since the fact is that a domain itself should never be invalid for attaching to via its own ops, and a group should never be inherently invalid for attaching to a suitable domain, it is only ever a particular combination of group (or device at the internal level) and domain that may not be valid together. Thus as long as we can maintain that basic guarantee that
Re: [PATCH] iommu/virtio: Fix compile error with viommu_capable()
On 2022-09-07 16:11, Joerg Roedel wrote: From: Joerg Roedel A recent fix introduced viommu_capable() but other changes from Robin change the function signature of the call-back it is used for. When both changes are merged a compile error will happen because the function pointer types mismatch. Fix that by updating the viommu_capable() signature after the merge. I thought I'd called out somewhere that this was going to be a conflict, but apparently not, sorry about that. Acked-by: Robin Murphy Lemme spin a patch for the outstanding LKP warning on the bus series before that gets annoying too... Cc: Jean-Philippe Brucker Cc: Robin Murphy Signed-off-by: Joerg Roedel --- drivers/iommu/virtio-iommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index da463db9f12a..1b12825e2df1 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -1005,7 +1005,7 @@ static int viommu_of_xlate(struct device *dev, struct of_phandle_args *args) return iommu_fwspec_add_ids(dev, args->args, 1); } -static bool viommu_capable(enum iommu_cap cap) +static bool viommu_capable(struct device *dev, enum iommu_cap cap) { switch (cap) { case IOMMU_CAP_CACHE_COHERENCY: ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v6 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group
On 2022-09-07 14:47, Jason Gunthorpe wrote: On Wed, Sep 07, 2022 at 02:41:54PM +0200, Joerg Roedel wrote: On Mon, Aug 15, 2022 at 11:14:33AM -0700, Nicolin Chen wrote: Provide a dedicated errno from the IOMMU driver during attach that the reason attached failed is because of domain incompatability. EMEDIUMTYPE is chosen because it is never used within the iommu subsystem today and evokes a sense that the 'medium' aka the domain is incompatible. I am not a fan of re-using EMEDIUMTYPE or any other special value. What is needed here in EINVAL, but with a way to tell the caller which of the function parameters is actually invalid. Using errnos to indicate the nature of failure is a well established unix practice, it is why we have hundreds of error codes and don't just return -EINVAL for everything. What don't you like about it? Would you be happier if we wrote it like #define IOMMU_EINCOMPATIBLE_DEVICE xx Which tells "which of the function parameters is actually invalid" ? FWIW, we're now very close to being able to validate dev->iommu against where the domain came from in core code, and so short-circuit ->attach_dev entirely if they don't match. At that point -EINVAL at the driver callback level could be assumed to refer to the domain argument, while anything else could be taken as something going unexpectedly wrong when the attach may otherwise have worked. I've forgotten if we actually had a valid case anywhere for "this is my device but even if you retry with a different domain it's still never going to work", but I think we wouldn't actually need that anyway - it should be clear enough to a caller that if attaching to an existing domain fails, then allocating a fresh domain and attaching also fails, that's the point to give up. Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3] iommu/virtio: Fix interaction with VFIO
On 2022-08-25 16:46, Jean-Philippe Brucker wrote: Commit e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence") requires IOMMU drivers to advertise IOMMU_CAP_CACHE_COHERENCY, in order to be used by VFIO. Since VFIO does not provide to userspace the ability to maintain coherency through cache invalidations, it requires hardware coherency. Advertise the capability in order to restore VFIO support. The meaning of IOMMU_CAP_CACHE_COHERENCY also changed from "IOMMU can enforce cache coherent DMA transactions" to "IOMMU_CACHE is supported". Argh! Massive apologies, I've been totally overlooking that detail and forgetting that we ended up splitting out the dedicated enforce_cache_coherency op... I do need reminding sometimes :) While virtio-iommu cannot enforce coherency (of PCIe no-snoop transactions), it does support IOMMU_CACHE. We can distinguish different cases of non-coherent DMA: (1) When accesses from a hardware endpoint are not coherent. The host would describe such a device using firmware methods ('dma-coherent' in device-tree, '_CCA' in ACPI), since they are also needed without a vIOMMU. In this case mappings are created without IOMMU_CACHE. virtio-iommu doesn't need any additional support. It sends the same requests as for coherent devices. (2) When the physical IOMMU supports non-cacheable mappings. Supporting those would require a new feature in virtio-iommu, new PROBE request property and MAP flags. Device drivers would use a new API to discover this since it depends on the architecture and the physical IOMMU. (3) When the hardware supports PCIe no-snoop. It is possible for assigned PCIe devices to issue no-snoop transactions, and the virtio-iommu specification is lacking any mention of this. Arm platforms don't necessarily support no-snoop, and those that do cannot enforce coherency of no-snoop transactions. Device drivers must be careful about assuming that no-snoop transactions won't end up cached; see commit e02f5c1bb228 ("drm: disable uncached DMA optimization for ARM and arm64"). On x86 platforms, the host may or may not enforce coherency of no-snoop transactions with the physical IOMMU. But according to the above commit, on x86 a driver which assumes that no-snoop DMA is compatible with uncached CPU mappings will also work if the host enforces coherency. Although these issues are not specific to virtio-iommu, it could be used to facilitate discovery and configuration of no-snoop. This would require a new feature bit, PROBE property and ATTACH/MAP flags. Interpreted in the *correct* context, I do think this is objectively less wrong than before. We can't guarantee that the underlying implementation will respect cacheable mappings, but it is true that we can do everything in our power to ask for them. Reviewed-by: Robin Murphy Cc: sta...@vger.kernel.org Fixes: e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence") Signed-off-by: Jean-Philippe Brucker --- Since v2 [1], I tried to refine the commit message. This fix is needed for v5.19 and v6.0. I can improve the check once Robin's change [2] is merged: capable(IOMMU_CAP_CACHE_COHERENCY) could return dev->dma_coherent for case (1) above. [1] https://lore.kernel.org/linux-iommu/20220818163801.1011548-1-jean-phili...@linaro.org/ [2] https://lore.kernel.org/linux-iommu/d8bd8777d06929ad8f49df7fc80e1b9af32a41b5.1660574547.git.robin.mur...@arm.com/ --- drivers/iommu/virtio-iommu.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 08eeafc9529f..80151176ba12 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -1006,7 +1006,18 @@ static int viommu_of_xlate(struct device *dev, struct of_phandle_args *args) return iommu_fwspec_add_ids(dev, args->args, 1); } +static bool viommu_capable(enum iommu_cap cap) +{ + switch (cap) { + case IOMMU_CAP_CACHE_COHERENCY: + return true; + default: + return false; + } +} + static struct iommu_ops viommu_ops = { + .capable= viommu_capable, .domain_alloc = viommu_domain_alloc, .probe_device = viommu_probe_device, .probe_finalize = viommu_probe_finalize, ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2] iommu/virtio: Fix interaction with VFIO
On 2022-08-19 11:38, Jean-Philippe Brucker wrote: On Thu, Aug 18, 2022 at 09:10:25PM +0100, Robin Murphy wrote: On 2022-08-18 17:38, Jean-Philippe Brucker wrote: Commit e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence") requires IOMMU drivers to advertise IOMMU_CAP_CACHE_COHERENCY, in order to be used by VFIO. Since VFIO does not provide to userspace the ability to maintain coherency through cache invalidations, it requires hardware coherency. Advertise the capability in order to restore VFIO support. The meaning of IOMMU_CAP_CACHE_COHERENCY also changed from "IOMMU can enforce cache coherent DMA transactions" to "IOMMU_CACHE is supported". While virtio-iommu cannot enforce coherency (of PCIe no-snoop transactions), it does support IOMMU_CACHE. Non-coherent accesses are not currently a concern for virtio-iommu because host OSes only assign coherent devices, Is that guaranteed though? I see nothing in VFIO checking *device* coherency, only that the *IOMMU* can impose it via this capability, which would form a very circular argument. Yes the wording is wrong here, more like "host OSes only assign devices whose accesses are coherent". And it's not guaranteed, just I'm still looking for a realistic counter-example. I guess a good indicator would be a VMM that presents a device without 'dma-coherent'. vfio-amba with the pl330 on Juno, perhaps? We can no longer say that in practice nobody has a VFIO-capable IOMMU in front of non-coherent PCI, now that Rockchip RK3588 boards are about to start shipping (at best we can only say that they still won't have the SMMUs in the DT until I've finished ripping up the bus ops). Ah, I was hoping that vfio-pci should only be concerned about no-snoop. Do you know if your series [2] ensures that the SMMU driver doesn't report IOMMU_CAP_CACHE_COHERENCY for that system? It should do, since the downstream DT says the SMMU is non-coherent. and the guest does not enable PCIe no-snoop. Nevertheless, we can summarize here the possible support for non-coherent DMA: (1) When accesses from a hardware endpoint are not coherent. The host would describe such a device using firmware methods ('dma-coherent' in device-tree, '_CCA' in ACPI), since they are also needed without a vIOMMU. In this case mappings are created without IOMMU_CACHE. virtio-iommu doesn't need any additional support. It sends the same requests as for coherent devices. (2) When the physical IOMMU supports non-cacheable mappings. Supporting those would require a new feature in virtio-iommu, new PROBE request property and MAP flags. Device drivers would use a new API to discover this since it depends on the architecture and the physical IOMMU. (3) When the hardware supports PCIe no-snoop. Some architecture do not support this either (whether no-snoop is supported by an Arm system is not discoverable by software). If Linux did enable no-snoop in endpoints on x86, then virtio-iommu would need additional feature, PROBE property, ATTACH and/or MAP flags to support enforcing snoop. That's not an "if" - various Linux drivers *do* use no-snoop, which IIUC is the main reason for VFIO wanting to enforce this in the first place. For example, see the big fat comment in drm_arch_can_wc_memory() if you've forgotten the fun we had with AMD GPUs in the TX2 boxes back in the day ;) Ah duh, I missed that PCI_EXP_DEVCTL_NOSNOOP_EN defaults to 1, of course it does. So I think VFIO should clear it on Arm and make it read-only, since the SMMU can't force-snoop like on x86. I'd be tempted to do that if CONFIG_ARM{,64} is enabled, but checking a new IOMMU capability may be cleaner. I think that's a good idea, but IIRC Jason mentioned in review of the VFIO series that it's not sufficient to provide the actual guarantee we're after, since there are out-of-spec devices that ignore the control and may send no-snoop packets anyway. However, as part of a best-effort approach for arm64 it still makes sense to help all the well-behaved drivers/devices do the right thing. Cheers, Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2] iommu/virtio: Fix interaction with VFIO
On 2022-08-18 17:38, Jean-Philippe Brucker wrote: Commit e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence") requires IOMMU drivers to advertise IOMMU_CAP_CACHE_COHERENCY, in order to be used by VFIO. Since VFIO does not provide to userspace the ability to maintain coherency through cache invalidations, it requires hardware coherency. Advertise the capability in order to restore VFIO support. The meaning of IOMMU_CAP_CACHE_COHERENCY also changed from "IOMMU can enforce cache coherent DMA transactions" to "IOMMU_CACHE is supported". While virtio-iommu cannot enforce coherency (of PCIe no-snoop transactions), it does support IOMMU_CACHE. Non-coherent accesses are not currently a concern for virtio-iommu because host OSes only assign coherent devices, Is that guaranteed though? I see nothing in VFIO checking *device* coherency, only that the *IOMMU* can impose it via this capability, which would form a very circular argument. We can no longer say that in practice nobody has a VFIO-capable IOMMU in front of non-coherent PCI, now that Rockchip RK3588 boards are about to start shipping (at best we can only say that they still won't have the SMMUs in the DT until I've finished ripping up the bus ops). and the guest does not enable PCIe no-snoop. Nevertheless, we can summarize here the possible support for non-coherent DMA: (1) When accesses from a hardware endpoint are not coherent. The host would describe such a device using firmware methods ('dma-coherent' in device-tree, '_CCA' in ACPI), since they are also needed without a vIOMMU. In this case mappings are created without IOMMU_CACHE. virtio-iommu doesn't need any additional support. It sends the same requests as for coherent devices. (2) When the physical IOMMU supports non-cacheable mappings. Supporting those would require a new feature in virtio-iommu, new PROBE request property and MAP flags. Device drivers would use a new API to discover this since it depends on the architecture and the physical IOMMU. (3) When the hardware supports PCIe no-snoop. Some architecture do not support this either (whether no-snoop is supported by an Arm system is not discoverable by software). If Linux did enable no-snoop in endpoints on x86, then virtio-iommu would need additional feature, PROBE property, ATTACH and/or MAP flags to support enforcing snoop. That's not an "if" - various Linux drivers *do* use no-snoop, which IIUC is the main reason for VFIO wanting to enforce this in the first place. For example, see the big fat comment in drm_arch_can_wc_memory() if you've forgotten the fun we had with AMD GPUs in the TX2 boxes back in the day ;) This is what I was getting at in reply to v1, it's really not a "this is fine as things stand" kind of patch, it's a "this is the best we can do to be less wrong for expected usage, but still definitely not right". Admittedly I downplayed that a little in [2] by deliberately avoiding all mention of no-snoop, but only because that's such a horrific unsolvable mess it's hardly worth the pain of bringing up... Cheers, Robin. Fixes: e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence") Signed-off-by: Jean-Philippe Brucker --- Since v1 [1], I added some details to the commit message. This fix is still needed for v5.19 and v6.0. I can improve the check once Robin's change [2] is merged: capable(IOMMU_CAP_CACHE_COHERENCY) could return dev->dma_coherent for case (1) above. [1] https://lore.kernel.org/linux-iommu/20220714111059.708735-1-jean-phili...@linaro.org/ [2] https://lore.kernel.org/linux-iommu/d8bd8777d06929ad8f49df7fc80e1b9af32a41b5.1660574547.git.robin.mur...@arm.com/ --- drivers/iommu/virtio-iommu.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 08eeafc9529f..80151176ba12 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -1006,7 +1006,18 @@ static int viommu_of_xlate(struct device *dev, struct of_phandle_args *args) return iommu_fwspec_add_ids(dev, args->args, 1); } +static bool viommu_capable(enum iommu_cap cap) +{ + switch (cap) { + case IOMMU_CAP_CACHE_COHERENCY: + return true; + default: + return false; + } +} + static struct iommu_ops viommu_ops = { + .capable= viommu_capable, .domain_alloc = viommu_domain_alloc, .probe_device = viommu_probe_device, .probe_finalize = viommu_probe_finalize, ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] iommu/virtio: Advertise IOMMU_CAP_CACHE_COHERENCY
On 2022-07-14 14:00, Jean-Philippe Brucker wrote: On Thu, Jul 14, 2022 at 01:01:37PM +0100, Robin Murphy wrote: On 2022-07-14 12:11, Jean-Philippe Brucker wrote: Fix virtio-iommu interaction with VFIO, as VFIO now requires IOMMU_CAP_CACHE_COHERENCY. virtio-iommu does not support non-cacheable mappings, and always expects to be called with IOMMU_CACHE. Can we know this is actually true though? What if the virtio-iommu implementation is backed by something other than VFIO, and the underlying hardware isn't coherent? AFAICS the spec doesn't disallow that. Right, I should add a note about that. If someone does actually want to support non-coherent device, I assume we'll add a per-device property, a 'non-cacheable' mapping flag, and IOMMU_CAP_CACHE_COHERENCY will hold. I'm also planning to add a check on (IOMMU_CACHE && !IOMMU_NOEXEC) in viommu_map(), but not as a fix. But what about all the I/O-coherent PL330s? :P (IIRC you can actually make a Juno do that with S2CR.MTCFG hacks...) In the meantime we do need to restore VFIO support under virtio-iommu, since userspace still expects that to work, and the existing use-cases are coherent devices. Yeah, I'm not necessarily against adding this as a horrible bodge for now - the reality is that people using VFIO must be doing it on coherent systems or it wouldn't be working properly anyway - as long as we all agree that that's what it is. Next cycle I'll be sending the follow-up patches to bring device_iommu_capable() to its final form (hoping the outstanding VDPA patch lands in the meantime), at which point we get to sort-of-fix the SMMU drivers[1], and can do something similar here too. I guess the main question for virtio-iommu is whether it needs to be described/negotiated in the protocol itself, or can be reliably described by other standard firmware properties (with maybe just a spec not to clarify that coherency must be consistent). Cheers, Robin. [1] https://gitlab.arm.com/linux-arm/linux-rm/-/commit/d8256bf48c8606cbaa6f0815696c2a6dbb72f1b0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] iommu/virtio: Advertise IOMMU_CAP_CACHE_COHERENCY
On 2022-07-14 12:11, Jean-Philippe Brucker wrote: Fix virtio-iommu interaction with VFIO, as VFIO now requires IOMMU_CAP_CACHE_COHERENCY. virtio-iommu does not support non-cacheable mappings, and always expects to be called with IOMMU_CACHE. Can we know this is actually true though? What if the virtio-iommu implementation is backed by something other than VFIO, and the underlying hardware isn't coherent? AFAICS the spec doesn't disallow that. Thanks, Robin. Fixes: e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence") Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/virtio-iommu.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 25be4b822aa0..bf340d779c10 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -1006,7 +1006,18 @@ static int viommu_of_xlate(struct device *dev, struct of_phandle_args *args) return iommu_fwspec_add_ids(dev, args->args, 1); } +static bool viommu_capable(enum iommu_cap cap) +{ + switch (cap) { + case IOMMU_CAP_CACHE_COHERENCY: + return true; + default: + return false; + } +} + static struct iommu_ops viommu_ops = { + .capable= viommu_capable, .domain_alloc = viommu_domain_alloc, .probe_device = viommu_probe_device, .probe_finalize = viommu_probe_finalize, ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] vdpa: Use device_iommu_capable()
On 2022-06-08 12:48, Robin Murphy wrote: Use the new interface to check the capability for our device specifically. Just checking in case this got lost - vdpa is now the only remaining iommu_capable() user in linux-next, and I'd like to be able to remove the old interface next cycle. Thanks, Robin. Signed-off-by: Robin Murphy --- drivers/vhost/vdpa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 935a1d0ddb97..4cfebcc24a03 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -1074,7 +1074,7 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v) if (!bus) return -EFAULT; - if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY)) + if (!device_iommu_capable(dma_dev, IOMMU_CAP_CACHE_COHERENCY)) return -ENOTSUPP; v->domain = iommu_domain_alloc(bus); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group
On 01/07/2022 5:43 pm, Nicolin Chen wrote: On Fri, Jul 01, 2022 at 11:21:48AM +0100, Robin Murphy wrote: diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c index 2ed3594f384e..072cac5ab5a4 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c @@ -1135,10 +1135,8 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev) struct arm_smmu_device *smmu; int ret; - if (!fwspec || fwspec->ops != _smmu_ops) { - dev_err(dev, "cannot attach to SMMU, is it on the same bus?\n"); - return -ENXIO; - } + if (!fwspec || fwspec->ops != _smmu_ops) + return -EMEDIUMTYPE; This is the wrong check, you want the "if (smmu_domain->smmu != smmu)" condition further down. If this one fails it's effectively because the device doesn't have an IOMMU at all, and similar to patch #3 it will be Thanks for the review! I will fix that. The "on the same bus" is quite eye-catching. removed once the core code takes over properly (I even have both those patches written now!) Actually in my v1 the proposal for ops check returned -EMEDIUMTYPE also upon an ops mismatch, treating that too as an incompatibility. Do you mean that we should have fine-grained it further? On second look, I think this particular check was already entirely redundant by the time I made the fwspec conversion to it, oh well. Since it remains harmless for the time being, let's just ignore it entirely until we can confidently say goodbye to the whole lot[1]. I don't think there's any need to differentiate an instance mismatch from a driver mismatch, once the latter becomes realistically possible, mostly due to iommu_domain_alloc() also having to become device-aware to know which driver to allocate from. Thus as far as a user is concerned, if attaching a device to an existing domain fails with -EMEDIUMTYPE, allocating a new domain using the given device, and attaching to that, can be expected to succeed, regardless of why the original attempt was rejected. In fact even in the theoretical different-driver-per-bus model the same principle still holds up. Thanks, Robin. [1] https://gitlab.arm.com/linux-arm/linux-rm/-/commit/aa4accfa4a10e92daad0d51095918e8a89014393 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group
On 2022-06-30 21:36, Nicolin Chen wrote: Cases like VFIO wish to attach a device to an existing domain that was not allocated specifically from the device. This raises a condition where the IOMMU driver can fail the domain attach because the domain and device are incompatible with each other. This is a soft failure that can be resolved by using a different domain. Provide a dedicated errno from the IOMMU driver during attach that the reason attached failed is because of domain incompatability. EMEDIUMTYPE is chosen because it is never used within the iommu subsystem today and evokes a sense that the 'medium' aka the domain is incompatible. VFIO can use this to know attach is a soft failure and it should continue searching. Otherwise the attach will be a hard failure and VFIO will return the code to userspace. Update all drivers to return EMEDIUMTYPE in their failure paths that are related to domain incompatability. Also remove adjacent error prints for these soft failures, to prevent a kernel log spam, since -EMEDIUMTYPE is clear enough to indicate an incompatability error. Add kdocs describing this behavior. Suggested-by: Jason Gunthorpe Reviewed-by: Kevin Tian Signed-off-by: Nicolin Chen --- [...] diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c index 2ed3594f384e..072cac5ab5a4 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c @@ -1135,10 +1135,8 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev) struct arm_smmu_device *smmu; int ret; - if (!fwspec || fwspec->ops != _smmu_ops) { - dev_err(dev, "cannot attach to SMMU, is it on the same bus?\n"); - return -ENXIO; - } + if (!fwspec || fwspec->ops != _smmu_ops) + return -EMEDIUMTYPE; This is the wrong check, you want the "if (smmu_domain->smmu != smmu)" condition further down. If this one fails it's effectively because the device doesn't have an IOMMU at all, and similar to patch #3 it will be removed once the core code takes over properly (I even have both those patches written now!) Thanks, Robin. /* * FIXME: The arch/arm DMA API code tries to attach devices to its own ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 1/5] iommu: Return -EMEDIUMTYPE for incompatible domain and device/group
On 2022-06-29 20:47, Nicolin Chen wrote: On Fri, Jun 24, 2022 at 03:19:43PM -0300, Jason Gunthorpe wrote: On Fri, Jun 24, 2022 at 06:35:49PM +0800, Yong Wu wrote: It's not used in VFIO context. "return 0" just satisfy the iommu framework to go ahead. and yes, here we only allow the shared "mapping-domain" (All the devices share a domain created internally). What part of the iommu framework is trying to attach a domain and wants to see success when the domain was not actually attached ? What prevent this driver from being used in VFIO context? Nothing prevent this. Just I didn't test. This is why it is wrong to return success here. Hi Yong, would you or someone you know be able to confirm whether this "return 0" is still a must or not? From memory, it is unfortunately required, due to this driver being in the rare position of having to support multiple devices in a single address space on 32-bit ARM. Since the old ARM DMA code doesn't understand groups, the driver sets up its own canonical dma_iommu_mapping to act like a default domain, but then has to politely say "yeah OK" to arm_setup_iommu_dma_ops() for each device so that they do all end up with the right DMA ops rather than dying in screaming failure (the ARM code's per-device mappings then get leaked, but we can't really do any better). The whole mess disappears in the proper default domain conversion, but in the meantime, it's still safe to assume that nobody's doing VFIO with embedded display/video codec/etc. blocks that don't even have reset drivers. Thanks, Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v6 00/22] Add generic memory shrinker to VirtIO-GPU and Panfrost DRM drivers
On 2022-05-27 00:50, Dmitry Osipenko wrote: Hello, This patchset introduces memory shrinker for the VirtIO-GPU DRM driver and adds memory purging and eviction support to VirtIO-GPU driver. The new dma-buf locking convention is introduced here as well. During OOM, the shrinker will release BOs that are marked as "not needed" by userspace using the new madvise IOCTL, it will also evict idling BOs to SWAP. The userspace in this case is the Mesa VirGL driver, it will mark the cached BOs as "not needed", allowing kernel driver to release memory of the cached shmem BOs on lowmem situations, preventing OOM kills. The Panfrost driver is switched to use generic memory shrinker. I think we still have some outstanding issues here - Alyssa reported some weirdness yesterday, so I just tried provoking a low-memory condition locally with this series applied and a few debug options enabled, and the results as below were... interesting. Thanks, Robin. ->8- [ 68.295951] == [ 68.295956] WARNING: possible circular locking dependency detected [ 68.295963] 5.19.0-rc3+ #400 Not tainted [ 68.295972] -- [ 68.295977] cc1/295 is trying to acquire lock: [ 68.295986] 08d7f1a0 (reservation_ww_class_mutex){+.+.}-{3:3}, at: drm_gem_shmem_free+0x7c/0x198 [ 68.296036] [ 68.296036] but task is already holding lock: [ 68.296041] 8c14b820 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0x4d8/0x1470 [ 68.296080] [ 68.296080] which lock already depends on the new lock. [ 68.296080] [ 68.296085] [ 68.296085] the existing dependency chain (in reverse order) is: [ 68.296090] [ 68.296090] -> #1 (fs_reclaim){+.+.}-{0:0}: [ 68.296111]fs_reclaim_acquire+0xb8/0x150 [ 68.296130]dma_resv_lockdep+0x298/0x3fc [ 68.296148]do_one_initcall+0xe4/0x5f8 [ 68.296163]kernel_init_freeable+0x414/0x49c [ 68.296180]kernel_init+0x2c/0x148 [ 68.296195]ret_from_fork+0x10/0x20 [ 68.296207] [ 68.296207] -> #0 (reservation_ww_class_mutex){+.+.}-{3:3}: [ 68.296229]__lock_acquire+0x1724/0x2398 [ 68.296246]lock_acquire+0x218/0x5b0 [ 68.296260]__ww_mutex_lock.constprop.0+0x158/0x2378 [ 68.296277]ww_mutex_lock+0x7c/0x4d8 [ 68.296291]drm_gem_shmem_free+0x7c/0x198 [ 68.296304]panfrost_gem_free_object+0x118/0x138 [ 68.296318]drm_gem_object_free+0x40/0x68 [ 68.296334]drm_gem_shmem_shrinker_run_objects_scan+0x42c/0x5b8 [ 68.296352]drm_gem_shmem_shrinker_scan_objects+0xa4/0x170 [ 68.296368]do_shrink_slab+0x220/0x808 [ 68.296381]shrink_slab+0x11c/0x408 [ 68.296392]shrink_node+0x6ac/0xb90 [ 68.296403]do_try_to_free_pages+0x1dc/0x8d0 [ 68.296416]try_to_free_pages+0x1ec/0x5b0 [ 68.296429]__alloc_pages_slowpath.constprop.0+0x528/0x1470 [ 68.296444]__alloc_pages+0x4e0/0x5b8 [ 68.296455]__folio_alloc+0x24/0x60 [ 68.296467]vma_alloc_folio+0xb8/0x2f8 [ 68.296483]alloc_zeroed_user_highpage_movable+0x58/0x68 [ 68.296498]__handle_mm_fault+0x918/0x12a8 [ 68.296513]handle_mm_fault+0x130/0x300 [ 68.296527]do_page_fault+0x1d0/0x568 [ 68.296539]do_translation_fault+0xa0/0xb8 [ 68.296551]do_mem_abort+0x68/0xf8 [ 68.296562]el0_da+0x74/0x100 [ 68.296572]el0t_64_sync_handler+0x68/0xc0 [ 68.296585]el0t_64_sync+0x18c/0x190 [ 68.296596] [ 68.296596] other info that might help us debug this: [ 68.296596] [ 68.296601] Possible unsafe locking scenario: [ 68.296601] [ 68.296604]CPU0CPU1 [ 68.296608] [ 68.296612] lock(fs_reclaim); [ 68.296622] lock(reservation_ww_class_mutex); [ 68.296633]lock(fs_reclaim); [ 68.296644] lock(reservation_ww_class_mutex); [ 68.296654] [ 68.296654] *** DEADLOCK *** [ 68.296654] [ 68.296658] 3 locks held by cc1/295: [ 68.29] #0: 0616e898 (>mmap_lock){}-{3:3}, at: do_page_fault+0x144/0x568 [ 68.296702] #1: 8c14b820 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0x4d8/0x1470 [ 68.296740] #2: 8c1215b0 (shrinker_rwsem){}-{3:3}, at: shrink_slab+0xc0/0x408 [ 68.296774] [ 68.296774] stack backtrace: [ 68.296780] CPU: 2 PID: 295 Comm: cc1 Not tainted 5.19.0-rc3+ #400 [ 68.296794] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Sep 3 2019 [ 68.296803] Call trace: [ 68.296808] dump_backtrace+0x1e4/0x1f0 [ 68.296821] show_stack+0x20/0x70 [ 68.296832] dump_stack_lvl+0x8c/0xb8 [ 68.296849] dump_stack+0x1c/0x38 [ 68.296864] print_circular_bug.isra.0+0x284/0x378 [ 68.296881] check_noncircular+0x1d8/0x1f8
Re: [PATCH v2 3/5] vfio/iommu_type1: Remove the domain->ops comparison
On 2022-06-24 14:16, Jason Gunthorpe wrote: On Wed, Jun 22, 2022 at 08:54:45AM +0100, Robin Murphy wrote: On 2022-06-16 23:23, Nicolin Chen wrote: On Thu, Jun 16, 2022 at 06:40:14AM +, Tian, Kevin wrote: The domain->ops validation was added, as a precaution, for mixed-driver systems. However, at this moment only one iommu driver is possible. So remove it. It's true on a physical platform. But I'm not sure whether a virtual platform is allowed to include multiple e.g. one virtio-iommu alongside a virtual VT-d or a virtual smmu. It might be clearer to claim that (as Robin pointed out) there is plenty more significant problems than this to solve instead of simply saying that only one iommu driver is possible if we don't have explicit code to reject such configuration. Will edit this part. Thanks! Oh, physical platforms with mixed IOMMUs definitely exist already. The main point is that while bus_set_iommu still exists, the core code effectively *does* prevent multiple drivers from registering - even in emulated cases like the example above, virtio-iommu and VT-d would both try to bus_set_iommu(_bus_type), and one of them will lose. The aspect which might warrant clarification is that there's no combination of supported drivers which claim non-overlapping buses *and* could appear in the same system - even if you tried to contrive something by emulating, say, VT-d (PCI) alongside rockchip-iommu (platform), you could still only describe one or the other due to ACPI vs. Devicetree. Right, and that is still something we need to protect against with this ops check. VFIO is not checking that the bus's are the same before attempting to re-use a domain. So it is actually functional and does protect against systems with multiple iommu drivers on different busses. But as above, which systems *are* those? Everything that's on my radar would have drivers all competing for the platform bus - Intel and s390 are somewhat the odd ones out in that respect, but are also non-issues as above. FWIW my iommu/bus dev branch has got as far as the final bus ops removal and allowing multiple driver registrations, and before it allows that, it does now have the common attach check that I sketched out in the previous discussion of this. It's probably also noteworthy that domain->ops is no longer the same domain->ops that this code was written to check, and may now be different between domains from the same driver. Thanks, Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 3/5] vfio/iommu_type1: Remove the domain->ops comparison
On 2022-06-16 23:23, Nicolin Chen wrote: On Thu, Jun 16, 2022 at 06:40:14AM +, Tian, Kevin wrote: The domain->ops validation was added, as a precaution, for mixed-driver systems. However, at this moment only one iommu driver is possible. So remove it. It's true on a physical platform. But I'm not sure whether a virtual platform is allowed to include multiple e.g. one virtio-iommu alongside a virtual VT-d or a virtual smmu. It might be clearer to claim that (as Robin pointed out) there is plenty more significant problems than this to solve instead of simply saying that only one iommu driver is possible if we don't have explicit code to reject such configuration. Will edit this part. Thanks! Oh, physical platforms with mixed IOMMUs definitely exist already. The main point is that while bus_set_iommu still exists, the core code effectively *does* prevent multiple drivers from registering - even in emulated cases like the example above, virtio-iommu and VT-d would both try to bus_set_iommu(_bus_type), and one of them will lose. The aspect which might warrant clarification is that there's no combination of supported drivers which claim non-overlapping buses *and* could appear in the same system - even if you tried to contrive something by emulating, say, VT-d (PCI) alongside rockchip-iommu (platform), you could still only describe one or the other due to ACPI vs. Devicetree. Thanks, Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 5/5] vfio/iommu_type1: Simplify group attachment
On 2022-06-17 03:53, Tian, Kevin wrote: From: Nicolin Chen Sent: Friday, June 17, 2022 6:41 AM ... - if (resv_msi) { + if (resv_msi && !domain->msi_cookie) { ret = iommu_get_msi_cookie(domain->domain, resv_msi_base); if (ret && ret != -ENODEV) goto out_detach; + domain->msi_cookie = true; } why not moving to alloc_attach_domain() then no need for the new domain field? It's required only when a new domain is allocated. When reusing an existing domain that doesn't have an msi_cookie, we can do iommu_get_msi_cookie() if resv_msi is found. So it is not limited to a new domain. Looks msi_cookie requirement is per platform (currently only for smmu. see arm_smmu_get_resv_regions()). If there is no mixed case then above check is not required. But let's hear whether Robin has a different thought here. Yes, the cookie should logically be tied to the lifetime of the domain itself. In the relevant context, "an existing domain that doesn't have an msi_cookie" should never exist. Thanks, Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH] vdpa: Use device_iommu_capable()
Use the new interface to check the capability for our device specifically. Signed-off-by: Robin Murphy --- drivers/vhost/vdpa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 935a1d0ddb97..4cfebcc24a03 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -1074,7 +1074,7 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v) if (!bus) return -EFAULT; - if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY)) + if (!device_iommu_capable(dma_dev, IOMMU_CAP_CACHE_COHERENCY)) return -ENOTSUPP; v->domain = iommu_domain_alloc(bus); -- 2.36.1.dirty ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/5] iommu: Ensure device has the same iommu_ops as the domain
On 2022-06-06 17:51, Nicolin Chen wrote: Hi Robin, On Mon, Jun 06, 2022 at 03:33:42PM +0100, Robin Murphy wrote: On 2022-06-06 07:19, Nicolin Chen wrote: The core code should not call an iommu driver op with a struct device parameter unless it knows that the dev_iommu_priv_get() for that struct device was setup by the same driver. Otherwise in a mixed driver system the iommu_priv could be casted to the wrong type. We don't have mixed-driver systems, and there are plenty more significant problems than this one to solve before we can (but thanks for pointing it out - I hadn't got as far as auditing the public interfaces yet). Once domains are allocated via a particular device's IOMMU instance in the first place, there will be ample opportunity for the core to stash suitable identifying information in the domain for itself. TBH even the current code could do it without needing the weirdly invasive changes here. Do you have an alternative and less invasive solution in mind? Store the iommu_ops pointer in the iommu_domain and use it as a check to validate that the struct device is correct before invoking any domain op that accepts a struct device. In fact this even describes exactly that - "Store the iommu_ops pointer in the iommu_domain", vs. the "Store the iommu_ops pointer in the iommu_domain_ops" which the patch is actually doing :/ Will fix that. Well, as before I'd prefer to make the code match the commit message - if I really need to spell it out, see below - since I can't imagine that we should ever have need to identify a set of iommu_domain_ops in isolation, therefore I think it's considerably clearer to use the iommu_domain itself. However, either way we really don't need this yet, so we may as well just go ahead and remove the redundant test from VFIO anyway, and I can add some form of this patch to my dev branch for now. Thanks, Robin. ->8- diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index cde2e1d6ab9b..72990edc9314 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1902,6 +1902,7 @@ static struct iommu_domain *__iommu_domain_alloc(struct device *dev, domain->type = type; /* Assume all sizes by default; the driver may override this later */ domain->pgsize_bitmap = ops->pgsize_bitmap; + domain->owner = ops; if (!domain->ops) domain->ops = ops->default_domain_ops; diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 6f64cbbc6721..79e557207f53 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -89,6 +89,7 @@ struct iommu_domain_geometry { struct iommu_domain { unsigned type; + const struct iommu_ops *owner; /* Who allocated this domain */ const struct iommu_domain_ops *ops; unsigned long pgsize_bitmap;/* Bitmap of page sizes in use */ iommu_fault_handler_t handler; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/5] iommu: Ensure device has the same iommu_ops as the domain
On 2022-06-06 07:19, Nicolin Chen wrote: The core code should not call an iommu driver op with a struct device parameter unless it knows that the dev_iommu_priv_get() for that struct device was setup by the same driver. Otherwise in a mixed driver system the iommu_priv could be casted to the wrong type. We don't have mixed-driver systems, and there are plenty more significant problems than this one to solve before we can (but thanks for pointing it out - I hadn't got as far as auditing the public interfaces yet). Once domains are allocated via a particular device's IOMMU instance in the first place, there will be ample opportunity for the core to stash suitable identifying information in the domain for itself. TBH even the current code could do it without needing the weirdly invasive changes here. Store the iommu_ops pointer in the iommu_domain and use it as a check to validate that the struct device is correct before invoking any domain op that accepts a struct device. In fact this even describes exactly that - "Store the iommu_ops pointer in the iommu_domain", vs. the "Store the iommu_ops pointer in the iommu_domain_ops" which the patch is actually doing :/ [...] diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 19cf28d40ebe..8a1f437a51f2 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1963,6 +1963,10 @@ static int __iommu_attach_device(struct iommu_domain *domain, { int ret; + /* Ensure the device was probe'd onto the same driver as the domain */ + if (dev->bus->iommu_ops != domain->ops->iommu_ops) Nope, dev_iommu_ops(dev) please. Furthermore I think the logical place to put this is in iommu_group_do_attach_device(), since that's the gateway for the public interfaces - we shouldn't need to second-guess ourselves for internal default-domain-related calls. Thanks, Robin. + return -EMEDIUMTYPE; + if (unlikely(domain->ops->attach_dev == NULL)) return -ENODEV; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/5] iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with dev_is_dma_coherent()
On 2022-04-07 14:59, Jason Gunthorpe wrote: On Thu, Apr 07, 2022 at 07:18:48AM +, Tian, Kevin wrote: From: Jason Gunthorpe Sent: Thursday, April 7, 2022 1:17 AM On Wed, Apr 06, 2022 at 06:10:31PM +0200, Christoph Hellwig wrote: On Wed, Apr 06, 2022 at 01:06:23PM -0300, Jason Gunthorpe wrote: On Wed, Apr 06, 2022 at 05:50:56PM +0200, Christoph Hellwig wrote: On Wed, Apr 06, 2022 at 12:18:23PM -0300, Jason Gunthorpe wrote: Oh, I didn't know about device_get_dma_attr().. Which is completely broken for any non-OF, non-ACPI plaform. I saw that, but I spent some time searching and could not find an iommu driver that would load independently of OF or ACPI. ie no IOMMU platform drivers are created by board files. Things like Intel/AMD discover only from ACPI, etc. Intel discovers IOMMUs (and optionally ACPI namespace devices) from ACPI, but there is no ACPI description for PCI devices i.e. the current logic of device_get_dma_attr() cannot be used on PCI devices. Oh? So on x86 acpi_get_dma_attr() returns DEV_DMA_NON_COHERENT or DEV_DMA_NOT_SUPPORTED? I think it _should_ return DEV_DMA_COHERENT on x86/IA-64 (unless a _CCA method was actually present to say otherwise), based on acpi_init_coherency(), but I only know for sure what happens on arm64. I think I should give up on this and just redefine the existing iommu cap flag to IOMMU_CAP_CACHE_SUPPORTED or something. TBH I don't see any issue with current name, but I'd certainly be happy to nail down a specific definition for it, along the lines of "this means that IOMMU_CACHE mappings are generally coherent". That works for things like Arm's S2FWB making it OK to assign an otherwise-non-coherent device without extra hassle. For the specific case of overriding PCIe No Snoop (which is more problematic from an Arm SMMU PoV) when assigning to a VM, would that not be easier solved by just having vfio-pci clear the "Enable No Snoop" control bit in the endpoint's PCIe capability? We could alternatively use existing device_get_dma_attr() as a default with an iommu wrapper and push the exception down through the iommu driver and s390 can override it. if going this way probably device_get_dma_attr() should be renamed to device_fwnode_get_dma_attr() instead to make it clearer? I'm looking at the few users: drivers/ata/ahci_ceva.c drivers/ata/ahci_qoriq.c - These are ARM only drivers. They are trying to copy the dma-coherent property from its DT/ACPI definition to internal register settings which look like they tune how the AXI bus transactions are created. I'm guessing the SATA IP block's AXI interface can be configured to generate coherent or non-coherent requests and it has to be set in a way that is consistent with the SOC architecture and match what the DMA API expects the device will do. drivers/crypto/ccp/sp-platform.c - Only used on ARM64 and also programs a HW register similar to the sata drivers. Refuses to work if the FW property is not present. drivers/net/ethernet/amd/xgbe/xgbe-platform.c - Seems to be configuring another ARM AXI block drivers/gpu/drm/panfrost/panfrost_drv.c - Robin's commit comment here is good, and one of the things this controls is if the coherent_walk is set for the io-pgtable-arm.c code which avoids DMA API calls drivers/gpu/drm/tegra/uapi.c - Returns DRM_TEGRA_CHANNEL_CAP_CACHE_COHERENT to userspace. No idea. My take is that the drivers using this API are doing it to make sure their HW blocks are setup in a way that is consistent with the DMA API they are also using, and run in constrained embedded-style environments that know the firmware support is present. So in the end it does not seem suitable right now for linking to IOMMU_CACHE.. That seems a pretty good summary - I think they're basically all "firmware told Linux I'm coherent so I'd better act coherent" cases, but that still doesn't necessarily mean that they're *forced* to respect that. One of the things on my to-do list is to try adding a DMA_ATTR_NO_SNOOP that can force DMA cache maintenance for coherent devices, primarily to hook up in Panfrost (where there is a bit of a performance to claw back on the coherent AmLogic SoCs by leaving certain buffers non-cacheable). Cheers, Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/5] iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with dev_is_dma_coherent()
On 2022-04-06 15:14, Jason Gunthorpe wrote: On Wed, Apr 06, 2022 at 03:51:50PM +0200, Christoph Hellwig wrote: On Wed, Apr 06, 2022 at 09:07:30AM -0300, Jason Gunthorpe wrote: Didn't see it I'll move dev_is_dma_coherent to device.h along with device_iommu_mapped() and others then No. It it is internal for a reason. It also doesn't actually work outside of the dma core. E.g. for non-swiotlb ARM configs it will not actually work. Really? It is the only condition that dma_info_to_prot() tests to decide of IOMMU_CACHE is used or not, so you are saying that there is a condition where a device can be attached to an iommu_domain and dev_is_dma_coherent() returns the wrong information? How does dma-iommu.c safely use it then? The common iommu-dma layer happens to be part of the subset of the DMA core which *does* play the dev->dma_coherent game. 32-bit Arm has its own IOMMU DMA ops which do not. I don't know if the set of PowerPCs with CONFIG_NOT_COHERENT_CACHE intersects the set of PowerPCs that can do VFIO, but that would be another example if so. In any case I still need to do something about the places checking IOMMU_CAP_CACHE_COHERENCY and thinking that means IOMMU_CACHE works. Any idea? Can we improve the IOMMU drivers such that that *can* be the case (within a reasonable margin of error)? That's kind of where I was hoping to head with device_iommu_capable(), e.g. [1]. Robin. [1] https://gitlab.arm.com/linux-arm/linux-rm/-/commit/53390e9505b3791adedc0974e251e5c7360e402e ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/5] iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with dev_is_dma_coherent()
On 2022-04-05 17:16, Jason Gunthorpe wrote: vdpa and usnic are trying to test if IOMMU_CACHE is supported. The correct way to do this is via dev_is_dma_coherent() Not necessarily... Disregarding the complete disaster of PCIe No Snoop on Arm-Based systems, there's the more interesting effectively-opposite scenario where an SMMU bridges non-coherent devices to a coherent interconnect. It's not something we take advantage of yet in Linux, and it can only be properly described in ACPI, but there do exist situations where IOMMU_CACHE is capable of making the device's traffic snoop, but dev_is_dma_coherent() - and device_get_dma_attr() for external users - would still say non-coherent because they can't assume that the SMMU is enabled and programmed in just the right way. I've also not thought too much about how things might look with S2FWB thrown into the mix in future... Robin. like the DMA API does. If IOMMU_CACHE is not supported then these drivers won't work as they don't call any coherency-restoring routines around their DMAs. Signed-off-by: Jason Gunthorpe --- drivers/infiniband/hw/usnic/usnic_uiom.c | 16 +++- drivers/vhost/vdpa.c | 3 ++- 2 files changed, 9 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c index 760b254ba42d6b..24d118198ac756 100644 --- a/drivers/infiniband/hw/usnic/usnic_uiom.c +++ b/drivers/infiniband/hw/usnic/usnic_uiom.c @@ -42,6 +42,7 @@ #include #include #include +#include #include "usnic_log.h" #include "usnic_uiom.h" @@ -474,6 +475,12 @@ int usnic_uiom_attach_dev_to_pd(struct usnic_uiom_pd *pd, struct device *dev) struct usnic_uiom_dev *uiom_dev; int err; + if (!dev_is_dma_coherent(dev)) { + usnic_err("IOMMU of %s does not support cache coherency\n", + dev_name(dev)); + return -EINVAL; + } + uiom_dev = kzalloc(sizeof(*uiom_dev), GFP_ATOMIC); if (!uiom_dev) return -ENOMEM; @@ -483,13 +490,6 @@ int usnic_uiom_attach_dev_to_pd(struct usnic_uiom_pd *pd, struct device *dev) if (err) goto out_free_dev; - if (!iommu_capable(dev->bus, IOMMU_CAP_CACHE_COHERENCY)) { - usnic_err("IOMMU of %s does not support cache coherency\n", - dev_name(dev)); - err = -EINVAL; - goto out_detach_device; - } - spin_lock(>lock); list_add_tail(_dev->link, >devs); pd->dev_cnt++; @@ -497,8 +497,6 @@ int usnic_uiom_attach_dev_to_pd(struct usnic_uiom_pd *pd, struct device *dev) return 0; -out_detach_device: - iommu_detach_device(pd->domain, dev); out_free_dev: kfree(uiom_dev); return err; diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 4c2f0bd062856a..05ea5800febc37 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -22,6 +22,7 @@ #include #include #include +#include #include "vhost.h" @@ -929,7 +930,7 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v) if (!bus) return -EFAULT; - if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY)) + if (!dev_is_dma_coherent(dma_dev)) return -ENOTSUPP; v->domain = iommu_domain_alloc(bus); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 1/2] drm/qxl: replace ioremap by ioremap_cache on arm64
On 2022-03-23 10:11, Gerd Hoffmann wrote: On Wed, Mar 23, 2022 at 09:45:13AM +, Robin Murphy wrote: On 2022-03-23 07:15, Christian K�nig wrote: Am 22.03.22 um 10:34 schrieb Cong Liu: qxl use ioremap to map ram_header and rom, in the arm64 implementation, the device is mapped as DEVICE_nGnRE, it can not support unaligned access. Well that some ARM boards doesn't allow unaligned access to MMIO space is a well known bug of those ARM boards. So as far as I know this is a hardware bug you are trying to workaround here and I'm not 100% sure that this is correct. No, this one's not a bug. The Device memory type used for iomem mappings is *architecturally* defined to enforce properties like aligned accesses, no speculation, no reordering, etc. If something wants to be treated more like RAM than actual MMIO registers, then ioremap_wc() or ioremap_cache() is the appropriate thing to do in general (with the former being a bit more portable according to Documentation/driver-api/device-io.rst). Well, qxl is a virtual device, so it *is* ram. I'm wondering whenever qxl actually works on arm? As far I know all virtual display devices with (virtual) pci memory bars for vram do not work on arm due to the guest mapping vram as io memory and the host mapping vram as normal ram and the mapping attribute mismatch causes caching troubles (only noticeable on real arm hardware, not in emulation). Did something change here recently? Indeed, Armv8.4 introduced the S2FWB feature to cope with situations like this - essentially it allows the hypervisor to share RAM-backed pages with the guest without losing coherency regardless of how the guest maps them. Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 1/2] drm/qxl: replace ioremap by ioremap_cache on arm64
On 2022-03-23 07:15, Christian König wrote: Am 22.03.22 um 10:34 schrieb Cong Liu: qxl use ioremap to map ram_header and rom, in the arm64 implementation, the device is mapped as DEVICE_nGnRE, it can not support unaligned access. Well that some ARM boards doesn't allow unaligned access to MMIO space is a well known bug of those ARM boards. So as far as I know this is a hardware bug you are trying to workaround here and I'm not 100% sure that this is correct. No, this one's not a bug. The Device memory type used for iomem mappings is *architecturally* defined to enforce properties like aligned accesses, no speculation, no reordering, etc. If something wants to be treated more like RAM than actual MMIO registers, then ioremap_wc() or ioremap_cache() is the appropriate thing to do in general (with the former being a bit more portable according to Documentation/driver-api/device-io.rst). Of course *then* you might find that on some systems the interconnect/PCIe implementation/endpoint doesn't actually like unaligned accesses once the CPU MMU starts allowing them to be sent out, but hey, one step at a time ;) Robin. Christian. 6.620515] pc : setup_hw_slot+0x24/0x60 [qxl] [ 6.620961] lr : setup_slot+0x34/0xf0 [qxl] [ 6.621376] sp : 800012b73760 [ 6.621701] x29: 800012b73760 x28: 0001 x27: 1000 [ 6.622400] x26: 0001 x25: 0400 x24: cf376848c000 [ 6.623099] x23: c4087400 x22: cf3718e17140 x21: [ 6.623823] x20: c4086000 x19: c40870b0 x18: 0014 [ 6.624519] x17: 4d3605ab x16: bb3b6129 x15: 6e771809 [ 6.625214] x14: 0001 x13: 007473696c5f7974 x12: 696e69615f65 [ 6.625909] x11: d543656a x10: x9 : cf3718e085a4 [ 6.626616] x8 : 006c7871 x7 : 000a x6 : 0017 [ 6.627343] x5 : 1400 x4 : 800011f63400 x3 : 1400 [ 6.628047] x2 : x1 : c40870b0 x0 : c4086000 [ 6.628751] Call trace: [ 6.628994] setup_hw_slot+0x24/0x60 [qxl] [ 6.629404] setup_slot+0x34/0xf0 [qxl] [ 6.629790] qxl_device_init+0x6f0/0x7f0 [qxl] [ 6.630235] qxl_pci_probe+0xdc/0x1d0 [qxl] [ 6.630654] local_pci_probe+0x48/0xb8 [ 6.631027] pci_device_probe+0x194/0x208 [ 6.631464] really_probe+0xd0/0x458 [ 6.631818] __driver_probe_device+0x124/0x1c0 [ 6.632256] driver_probe_device+0x48/0x130 [ 6.632669] __driver_attach+0xc4/0x1a8 [ 6.633049] bus_for_each_dev+0x78/0xd0 [ 6.633437] driver_attach+0x2c/0x38 [ 6.633789] bus_add_driver+0x154/0x248 [ 6.634168] driver_register+0x6c/0x128 [ 6.635205] __pci_register_driver+0x4c/0x58 [ 6.635628] qxl_init+0x48/0x1000 [qxl] [ 6.636013] do_one_initcall+0x50/0x240 [ 6.636390] do_init_module+0x60/0x238 [ 6.636768] load_module+0x2458/0x2900 [ 6.637136] __do_sys_finit_module+0xbc/0x128 [ 6.637561] __arm64_sys_finit_module+0x28/0x38 [ 6.638004] invoke_syscall+0x74/0xf0 [ 6.638366] el0_svc_common.constprop.0+0x58/0x1a8 [ 6.638836] do_el0_svc+0x2c/0x90 [ 6.639216] el0_svc+0x40/0x190 [ 6.639526] el0t_64_sync_handler+0xb0/0xb8 [ 6.639934] el0t_64_sync+0x1a4/0x1a8 [ 6.640294] Code: 910003fd f9484804 f9400c23 8b050084 (f809c083) [ 6.640889] ---[ end trace 95615d89b7c87f95 ]--- Signed-off-by: Cong Liu --- drivers/gpu/drm/qxl/qxl_kms.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/drivers/gpu/drm/qxl/qxl_kms.c b/drivers/gpu/drm/qxl/qxl_kms.c index 4dc5ad13f12c..0e61ac04d8ad 100644 --- a/drivers/gpu/drm/qxl/qxl_kms.c +++ b/drivers/gpu/drm/qxl/qxl_kms.c @@ -165,7 +165,11 @@ int qxl_device_init(struct qxl_device *qdev, (int)qdev->surfaceram_size / 1024, (sb == 4) ? "64bit" : "32bit"); +#ifdef CONFIG_ARM64 + qdev->rom = ioremap_cache(qdev->rom_base, qdev->rom_size); +#else qdev->rom = ioremap(qdev->rom_base, qdev->rom_size); +#endif if (!qdev->rom) { pr_err("Unable to ioremap ROM\n"); r = -ENOMEM; @@ -183,9 +187,15 @@ int qxl_device_init(struct qxl_device *qdev, goto rom_unmap; } +#ifdef CONFIG_ARM64 + qdev->ram_header = ioremap_cache(qdev->vram_base + + qdev->rom->ram_header_offset, + sizeof(*qdev->ram_header)); +#else qdev->ram_header = ioremap(qdev->vram_base + qdev->rom->ram_header_offset, sizeof(*qdev->ram_header)); +#endif if (!qdev->ram_header) { DRM_ERROR("Unable to ioremap RAM header\n"); r = -ENOMEM; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 4/8] drm/virtio: Improve DMA API usage for shmem BOs
On 2022-03-14 22:42, Dmitry Osipenko wrote: DRM API requires the DRM's driver to be backed with the device that can be used for generic DMA operations. The VirtIO-GPU device can't perform DMA operations if it uses PCI transport because PCI device driver creates a virtual VirtIO-GPU device that isn't associated with the PCI. Use PCI's GPU device for the DRM's device instead of the VirtIO-GPU device and drop DMA-related hacks from the VirtIO-GPU driver. Signed-off-by: Dmitry Osipenko --- drivers/gpu/drm/virtio/virtgpu_drv.c| 22 +++--- drivers/gpu/drm/virtio/virtgpu_drv.h| 5 +-- drivers/gpu/drm/virtio/virtgpu_kms.c| 7 ++-- drivers/gpu/drm/virtio/virtgpu_object.c | 56 + drivers/gpu/drm/virtio/virtgpu_vq.c | 13 +++--- 5 files changed, 37 insertions(+), 66 deletions(-) diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.c b/drivers/gpu/drm/virtio/virtgpu_drv.c index 5f25a8d15464..8449dad3e65c 100644 --- a/drivers/gpu/drm/virtio/virtgpu_drv.c +++ b/drivers/gpu/drm/virtio/virtgpu_drv.c @@ -46,9 +46,9 @@ static int virtio_gpu_modeset = -1; MODULE_PARM_DESC(modeset, "Disable/Enable modesetting"); module_param_named(modeset, virtio_gpu_modeset, int, 0400); -static int virtio_gpu_pci_quirk(struct drm_device *dev, struct virtio_device *vdev) +static int virtio_gpu_pci_quirk(struct drm_device *dev) { - struct pci_dev *pdev = to_pci_dev(vdev->dev.parent); + struct pci_dev *pdev = to_pci_dev(dev->dev); const char *pname = dev_name(>dev); bool vga = (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; char unique[20]; @@ -101,6 +101,7 @@ static int virtio_gpu_pci_quirk(struct drm_device *dev, struct virtio_device *vd static int virtio_gpu_probe(struct virtio_device *vdev) { struct drm_device *dev; + struct device *dma_dev; int ret; if (drm_firmware_drivers_only() && virtio_gpu_modeset == -1) @@ -109,18 +110,29 @@ static int virtio_gpu_probe(struct virtio_device *vdev) if (virtio_gpu_modeset == 0) return -EINVAL; - dev = drm_dev_alloc(, >dev); + /* +* If GPU's parent is a PCI device, then we will use this PCI device +* for the DRM's driver device because GPU won't have PCI's IOMMU DMA +* ops in this case since GPU device is sitting on a separate (from PCI) +* virtio-bus. +*/ + if (!strcmp(vdev->dev.parent->bus->name, "pci")) Nit: dev_is_pci() ? However, what about other VirtIO transports? Wouldn't virtio-mmio with F_ACCESS_PLATFORM be in a similar situation? Robin. + dma_dev = vdev->dev.parent; + else + dma_dev = >dev; + + dev = drm_dev_alloc(, dma_dev); if (IS_ERR(dev)) return PTR_ERR(dev); vdev->priv = dev; if (!strcmp(vdev->dev.parent->bus->name, "pci")) { - ret = virtio_gpu_pci_quirk(dev, vdev); + ret = virtio_gpu_pci_quirk(dev); if (ret) goto err_free; } - ret = virtio_gpu_init(dev); + ret = virtio_gpu_init(vdev, dev); if (ret) goto err_free; diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.h b/drivers/gpu/drm/virtio/virtgpu_drv.h index 0a194aaad419..b2d93cb12ebf 100644 --- a/drivers/gpu/drm/virtio/virtgpu_drv.h +++ b/drivers/gpu/drm/virtio/virtgpu_drv.h @@ -100,8 +100,6 @@ struct virtio_gpu_object { struct virtio_gpu_object_shmem { struct virtio_gpu_object base; - struct sg_table *pages; - uint32_t mapped; }; struct virtio_gpu_object_vram { @@ -214,7 +212,6 @@ struct virtio_gpu_drv_cap_cache { }; struct virtio_gpu_device { - struct device *dev; struct drm_device *ddev; struct virtio_device *vdev; @@ -282,7 +279,7 @@ extern struct drm_ioctl_desc virtio_gpu_ioctls[DRM_VIRTIO_NUM_IOCTLS]; void virtio_gpu_create_context(struct drm_device *dev, struct drm_file *file); /* virtgpu_kms.c */ -int virtio_gpu_init(struct drm_device *dev); +int virtio_gpu_init(struct virtio_device *vdev, struct drm_device *dev); void virtio_gpu_deinit(struct drm_device *dev); void virtio_gpu_release(struct drm_device *dev); int virtio_gpu_driver_open(struct drm_device *dev, struct drm_file *file); diff --git a/drivers/gpu/drm/virtio/virtgpu_kms.c b/drivers/gpu/drm/virtio/virtgpu_kms.c index 3313b92db531..0d1e3eb61bee 100644 --- a/drivers/gpu/drm/virtio/virtgpu_kms.c +++ b/drivers/gpu/drm/virtio/virtgpu_kms.c @@ -110,7 +110,7 @@ static void virtio_gpu_get_capsets(struct virtio_gpu_device *vgdev, vgdev->num_capsets = num_capsets; } -int virtio_gpu_init(struct drm_device *dev) +int virtio_gpu_init(struct virtio_device *vdev, struct drm_device *dev) { static vq_callback_t *callbacks[] = { virtio_gpu_ctrl_ack, virtio_gpu_cursor_ack @@ -123,7 +123,7 @@ int virtio_gpu_init(struct drm_device *dev) u32 num_scanouts,
Re: [PATCH v2] iommu/iova: Separate out rcache init
On 2022-02-03 09:59, John Garry wrote: Currently the rcache structures are allocated for all IOVA domains, even if they do not use "fast" alloc+free interface. This is wasteful of memory. In addition, fails in init_iova_rcaches() are not handled safely, which is less than ideal. Make "fast" users call a separate rcache init explicitly, which includes error checking. Reviewed-by: Robin Murphy Signed-off-by: John Garry --- Differences to v1: - Drop stubs for iova_domain_init_rcaches() and iova_domain_free_rcaches() - Use put_iova_domain() in vdpa code diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index d85d54f2b549..b22034975301 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -525,6 +525,7 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, struct iommu_dma_cookie *cookie = domain->iova_cookie; unsigned long order, base_pfn; struct iova_domain *iovad; + int ret; if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE) return -EINVAL; @@ -559,6 +560,9 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, } init_iova_domain(iovad, 1UL << order, base_pfn); + ret = iova_domain_init_rcaches(iovad); + if (ret) + return ret; /* If the FQ fails we can simply fall back to strict mode */ if (domain->type == IOMMU_DOMAIN_DMA_FQ && iommu_dma_init_fq(domain)) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index b28c9435b898..7e9c3a97c040 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -15,13 +15,14 @@ /* The anchor node sits above the top of the usable address space */ #define IOVA_ANCHOR ~0UL +#define IOVA_RANGE_CACHE_MAX_SIZE 6 /* log of max cached IOVA range size (in pages) */ + static bool iova_rcache_insert(struct iova_domain *iovad, unsigned long pfn, unsigned long size); static unsigned long iova_rcache_get(struct iova_domain *iovad, unsigned long size, unsigned long limit_pfn); -static void init_iova_rcaches(struct iova_domain *iovad); static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad); static void free_iova_rcaches(struct iova_domain *iovad); @@ -64,8 +65,6 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule, iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR; rb_link_node(>anchor.node, NULL, >rbroot.rb_node); rb_insert_color(>anchor.node, >rbroot); - cpuhp_state_add_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, >cpuhp_dead); - init_iova_rcaches(iovad); } EXPORT_SYMBOL_GPL(init_iova_domain); @@ -488,6 +487,13 @@ free_iova_fast(struct iova_domain *iovad, unsigned long pfn, unsigned long size) } EXPORT_SYMBOL_GPL(free_iova_fast); +static void iova_domain_free_rcaches(struct iova_domain *iovad) +{ + cpuhp_state_remove_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, + >cpuhp_dead); + free_iova_rcaches(iovad); +} + /** * put_iova_domain - destroys the iova domain * @iovad: - iova domain in question. @@ -497,9 +503,9 @@ void put_iova_domain(struct iova_domain *iovad) { struct iova *iova, *tmp; - cpuhp_state_remove_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, - >cpuhp_dead); - free_iova_rcaches(iovad); + if (iovad->rcaches) + iova_domain_free_rcaches(iovad); + rbtree_postorder_for_each_entry_safe(iova, tmp, >rbroot, node) free_iova_mem(iova); } @@ -608,6 +614,7 @@ EXPORT_SYMBOL_GPL(reserve_iova); */ #define IOVA_MAG_SIZE 128 +#define MAX_GLOBAL_MAGS 32 /* magazines per bin */ struct iova_magazine { unsigned long size; @@ -620,6 +627,13 @@ struct iova_cpu_rcache { struct iova_magazine *prev; }; +struct iova_rcache { + spinlock_t lock; + unsigned long depot_size; + struct iova_magazine *depot[MAX_GLOBAL_MAGS]; + struct iova_cpu_rcache __percpu *cpu_rcaches; +}; + static struct iova_magazine *iova_magazine_alloc(gfp_t flags) { return kzalloc(sizeof(struct iova_magazine), flags); @@ -693,28 +707,54 @@ static void iova_magazine_push(struct iova_magazine *mag, unsigned long pfn) mag->pfns[mag->size++] = pfn; } -static void init_iova_rcaches(struct iova_domain *iovad) +int iova_domain_init_rcaches(struct iova_domain *iovad) { - struct iova_cpu_rcache *cpu_rcache; - struct iova_rcache *rcache; unsigned int cpu; - int i; + int i, ret; + + iovad->rcaches = kcalloc(IOVA_RANGE_CACHE_MAX_SIZE, +sizeof(struct iova_rcache), +
Re: [PATCH] iommu/iova: Separate out rcache init
On 2022-01-28 11:32, John Garry wrote: On 26/01/2022 17:00, Robin Murphy wrote: As above, I vote for just forward-declaring the free routine in iova.c and keeping it entirely private. BTW, speaking of forward declarations, it's possible to remove all the forward declarations in iova.c now that the FQ code is gone - but with a good bit of rearranging. However I am not sure how much people care about that or whether the code layout is sane... Indeed, I was very tempted to raise the question there of whether there was any more cleanup or refactoring that could be done to justify collecting all the rcache code together at the top of iova.c. But in the end I didn't, so my opinion still remains a secret... Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] iommu/iova: Separate out rcache init
On 2022-01-26 13:55, John Garry wrote: Currently the rcache structures are allocated for all IOVA domains, even if they do not use "fast" alloc+free interface. This is wasteful of memory. In addition, fails in init_iova_rcaches() are not handled safely, which is less than ideal. Make "fast" users call a separate rcache init explicitly, which includes error checking. Signed-off-by: John Garry Mangled patch? (no "---" separator here) Overall this looks great, just a few comments further down... diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 3a46f2cc9e5d..dd066d990809 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -525,6 +525,7 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, struct iommu_dma_cookie *cookie = domain->iova_cookie; unsigned long order, base_pfn; struct iova_domain *iovad; + int ret; if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE) return -EINVAL; @@ -559,6 +560,9 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, } init_iova_domain(iovad, 1UL << order, base_pfn); + ret = iova_domain_init_rcaches(iovad); + if (ret) + return ret; /* If the FQ fails we can simply fall back to strict mode */ if (domain->type == IOMMU_DOMAIN_DMA_FQ && iommu_dma_init_fq(domain)) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index b28c9435b898..d3adc6ea5710 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -15,13 +15,14 @@ /* The anchor node sits above the top of the usable address space */ #define IOVA_ANCHOR ~0UL +#define IOVA_RANGE_CACHE_MAX_SIZE 6 /* log of max cached IOVA range size (in pages) */ + static bool iova_rcache_insert(struct iova_domain *iovad, unsigned long pfn, unsigned long size); static unsigned long iova_rcache_get(struct iova_domain *iovad, unsigned long size, unsigned long limit_pfn); -static void init_iova_rcaches(struct iova_domain *iovad); static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad); static void free_iova_rcaches(struct iova_domain *iovad); @@ -64,8 +65,6 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule, iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR; rb_link_node(>anchor.node, NULL, >rbroot.rb_node); rb_insert_color(>anchor.node, >rbroot); - cpuhp_state_add_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, >cpuhp_dead); - init_iova_rcaches(iovad); } EXPORT_SYMBOL_GPL(init_iova_domain); @@ -497,9 +496,9 @@ void put_iova_domain(struct iova_domain *iovad) { struct iova *iova, *tmp; - cpuhp_state_remove_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, - >cpuhp_dead); - free_iova_rcaches(iovad); + if (iovad->rcaches) + iova_domain_free_rcaches(iovad); + rbtree_postorder_for_each_entry_safe(iova, tmp, >rbroot, node) free_iova_mem(iova); } @@ -608,6 +607,7 @@ EXPORT_SYMBOL_GPL(reserve_iova); */ #define IOVA_MAG_SIZE 128 +#define MAX_GLOBAL_MAGS 32 /* magazines per bin */ struct iova_magazine { unsigned long size; @@ -620,6 +620,13 @@ struct iova_cpu_rcache { struct iova_magazine *prev; }; +struct iova_rcache { + spinlock_t lock; + unsigned long depot_size; + struct iova_magazine *depot[MAX_GLOBAL_MAGS]; + struct iova_cpu_rcache __percpu *cpu_rcaches; +}; + static struct iova_magazine *iova_magazine_alloc(gfp_t flags) { return kzalloc(sizeof(struct iova_magazine), flags); @@ -693,28 +700,62 @@ static void iova_magazine_push(struct iova_magazine *mag, unsigned long pfn) mag->pfns[mag->size++] = pfn; } -static void init_iova_rcaches(struct iova_domain *iovad) +int iova_domain_init_rcaches(struct iova_domain *iovad) { - struct iova_cpu_rcache *cpu_rcache; - struct iova_rcache *rcache; unsigned int cpu; - int i; + int i, ret; + + iovad->rcaches = kcalloc(IOVA_RANGE_CACHE_MAX_SIZE, +sizeof(struct iova_rcache), +GFP_KERNEL); + if (!iovad->rcaches) + return -ENOMEM; for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { + struct iova_cpu_rcache *cpu_rcache; + struct iova_rcache *rcache; + rcache = >rcaches[i]; spin_lock_init(>lock); rcache->depot_size = 0; - rcache->cpu_rcaches = __alloc_percpu(sizeof(*cpu_rcache), cache_line_size()); - if (WARN_ON(!rcache->cpu_rcaches)) - continue; + rcache->cpu_rcaches =
Re: [PATCH 4/5] iommu: Separate IOVA rcache memories from iova_domain structure
Hi John, On 2021-12-20 08:49, John Garry wrote: On 24/09/2021 11:01, John Garry wrote: Only dma-iommu.c and vdpa actually use the "fast" mode of IOVA alloc and free. As such, it's wasteful that all other IOVA domains hold the rcache memories. In addition, the current IOVA domain init implementation is poor (init_iova_domain()), in that errors are ignored and not passed to the caller. The only errors can come from the IOVA rcache init, and fixing up all the IOVA domain init callsites to handle the errors would take some work. Separate the IOVA rache out of the IOVA domain, and create a new IOVA domain structure, iova_caching_domain. Signed-off-by: John Garry Hi Robin, Do you have any thoughts on this patch? The decision is whether we stick with a single iova domain structure or support this super structure for iova domains which support the rcache. I did not try the former - it would be do-able but I am not sure on how it would look. TBH I feel inclined to take the simpler approach of just splitting the rcache array to a separate allocation, making init_iova_rcaches() public (with a proper return value), and tweaking put_iova_domain() to make rcache cleanup conditional. A residual overhead of 3 extra pointers in iova_domain doesn't seem like *too* much for non-DMA-API users to bear. Unless you want to try generalising the rcache mechanism completely away from IOVA API specifics, it doesn't seem like there's really enough to justify the bother of having its own distinct abstraction layer. Cheers, Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2] iova: Move fast alloc size roundup into alloc_iova_fast()
On 2021-12-07 11:17, John Garry wrote: It really is a property of the IOVA rcache code that we need to alloc a power-of-2 size, so relocate the functionality to resize into alloc_iova_fast(), rather than the callsites. I'd still much prefer to resolve the issue that there shouldn't *be* more than one caller in the first place, but hey. Acked-by: Robin Murphy Signed-off-by: John Garry Acked-by: Will Deacon Reviewed-by: Xie Yongji Acked-by: Jason Wang Acked-by: Michael S. Tsirkin --- Differences to v1: - Separate out from original series which conflicts with Robin's IOVA FQ work: https://lore.kernel.org/linux-iommu/1632477717-5254-1-git-send-email-john.ga...@huawei.com/ - Add tags - thanks! diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index b42e38a0dbe2..84dee53fe892 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -442,14 +442,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain, shift = iova_shift(iovad); iova_len = size >> shift; - /* -* Freeing non-power-of-two-sized allocations back into the IOVA caches -* will come back to bite us badly, so we have to waste a bit of space -* rounding up anything cacheable to make sure that can't happen. The -* order of the unadjusted size will still match upon freeing. -*/ - if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1))) - iova_len = roundup_pow_of_two(iova_len); dma_limit = min_not_zero(dma_limit, dev->bus_dma_limit); diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 9e8bc802ac05..ff567cbc42f7 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -497,6 +497,15 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long size, unsigned long iova_pfn; struct iova *new_iova; + /* +* Freeing non-power-of-two-sized allocations back into the IOVA caches +* will come back to bite us badly, so we have to waste a bit of space +* rounding up anything cacheable to make sure that can't happen. The +* order of the unadjusted size will still match upon freeing. +*/ + if (size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1))) + size = roundup_pow_of_two(size); + iova_pfn = iova_rcache_get(iovad, size, limit_pfn + 1); if (iova_pfn) return iova_pfn; diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c index 1daae2608860..2b1143f11d8f 100644 --- a/drivers/vdpa/vdpa_user/iova_domain.c +++ b/drivers/vdpa/vdpa_user/iova_domain.c @@ -292,14 +292,6 @@ vduse_domain_alloc_iova(struct iova_domain *iovad, unsigned long iova_len = iova_align(iovad, size) >> shift; unsigned long iova_pfn; - /* -* Freeing non-power-of-two-sized allocations back into the IOVA caches -* will come back to bite us badly, so we have to waste a bit of space -* rounding up anything cacheable to make sure that can't happen. The -* order of the unadjusted size will still match upon freeing. -*/ - if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1))) - iova_len = roundup_pow_of_two(iova_len); iova_pfn = alloc_iova_fast(iovad, iova_len, limit >> shift, true); return iova_pfn << shift; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/5] iommu: Some IOVA code reorganisation
On 2021-11-16 14:21, John Garry wrote: On 04/10/2021 12:44, Will Deacon wrote: On Fri, Sep 24, 2021 at 06:01:52PM +0800, John Garry wrote: The IOVA domain structure is a bit overloaded, holding: - IOVA tree management - FQ control - IOVA rcache memories Indeed only a couple of IOVA users use the rcache, and only dma-iommu.c uses the FQ feature. This series separates out that structure. In addition, it moves the FQ code into dma-iommu.c . This is not strictly necessary, but it does make it easier for the FQ domain lookup the rcache domain. The rcache code stays where it is, as it may be reworked in future, so there is not much point in relocating and then discarding. This topic was initially discussed and suggested (I think) by Robin here: https://lore.kernel.org/linux-iommu/1d06eda1-9961-d023-f5e7-fe87e768f...@arm.com/ It would be useful to have Robin's Ack on patches 2-4. The implementation looks straightforward to me, but the thread above isn't very clear about what is being suggested. Hi Robin, Just wondering if you had made any progress on your FQ code rework or your own re-org? Hey John - as it happens I started hacking on that in earnest about half an hour ago, aiming to get something out later this week. Cheers, Robin. I wasn't planning on progressing https://lore.kernel.org/linux-iommu/1626259003-201303-1-git-send-email-john.ga...@huawei.com/ until this is done first (and that is still a big issue), even though not strictly necessary. Thanks, John ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/5] iommu: Some IOVA code reorganisation
On 2021-10-04 12:44, Will Deacon wrote: On Fri, Sep 24, 2021 at 06:01:52PM +0800, John Garry wrote: The IOVA domain structure is a bit overloaded, holding: - IOVA tree management - FQ control - IOVA rcache memories Indeed only a couple of IOVA users use the rcache, and only dma-iommu.c uses the FQ feature. This series separates out that structure. In addition, it moves the FQ code into dma-iommu.c . This is not strictly necessary, but it does make it easier for the FQ domain lookup the rcache domain. The rcache code stays where it is, as it may be reworked in future, so there is not much point in relocating and then discarding. This topic was initially discussed and suggested (I think) by Robin here: https://lore.kernel.org/linux-iommu/1d06eda1-9961-d023-f5e7-fe87e768f...@arm.com/ It would be useful to have Robin's Ack on patches 2-4. The implementation looks straightforward to me, but the thread above isn't very clear about what is being suggested. FWIW I actually got about half-way through writing my own equivalent of patches 2-3, except tackling it from the other direction - simplifying the FQ code *before* moving whatever was left to iommu-dma, then I got side-tracked trying to make io-pgtable use that freelist properly, and then I've been on holiday the last 2 weeks. I've got other things to catch up on first but I'll try to get to this later this week. To play devil's advocate: there aren't many direct users of the iovad code: either they'll die out entirely (and everybody will use the dma-iommu code) and it's fine having the flush queue code where it is, or we'll get more users and the likelihood of somebody else wanting flush queues increases. I think the FQ code is mostly just here as a historical artefact, since the IOVA allocator was the only thing common to the Intel and AMD DMA ops when the common FQ implementation was factored out of those, so although it's essentially orthogonal it was still related enough that it was an easy place to stick it. Cheers, Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast()
On 2021-08-04 06:02, Yongji Xie wrote: On Tue, Aug 3, 2021 at 6:54 PM Robin Murphy wrote: On 2021-08-03 09:54, Yongji Xie wrote: On Tue, Aug 3, 2021 at 3:41 PM Jason Wang wrote: 在 2021/7/29 下午3:34, Xie Yongji 写道: Export alloc_iova_fast() and free_iova_fast() so that some modules can use it to improve iova allocation efficiency. It's better to explain why alloc_iova() is not sufficient here. Fine. What I fail to understand from the later patches is what the IOVA domain actually represents. If the "device" is a userspace process then logically the "IOVA" would be the userspace address, so presumably somewhere you're having to translate between this arbitrary address space and actual usable addresses - if you're worried about efficiency surely it would be even better to not do that? Yes, userspace daemon needs to translate the "IOVA" in a DMA descriptor to the VA (from mmap(2)). But this actually doesn't affect performance since it's an identical mapping in most cases. I'm not familiar with the vhost_iotlb stuff, but it looks suspiciously like you're walking yet another tree to make those translations. Even if the buffer can be mapped all at once with a fixed offset such that each DMA mapping call doesn't need a lookup for each individual "IOVA" - that might be what's happening already, but it's a bit hard to follow just reading the patches in my mail client - vhost_iotlb_add_range() doesn't look like it's super-cheap to call, and you're serialising on a lock for that. My main point, though, is that if you've already got something else keeping track of the actual addresses, then the way you're using an iova_domain appears to be something you could do with a trivial bitmap allocator. That's why I don't buy the efficiency argument. The main design points of the IOVA allocator are to manage large address spaces while trying to maximise spatial locality to minimise the underlying pagetable usage, and allocating with a flexible limit to support multiple devices with different addressing capabilities in the same address space. If none of those aspects are relevant to the use-case - which AFAICS appears to be true here - then as a general-purpose resource allocator it's rubbish and has an unreasonably massive memory overhead and there are many, many better choices. FWIW I've recently started thinking about moving all the caching stuff out of iova_domain and into the iommu-dma layer since it's now a giant waste of space for all the other current IOVA users. Presumably userspace doesn't have any concern about alignment and the things we have to worry about for the DMA API in general, so it's pretty much just allocating slots in a buffer, and there are far more effective ways to do that than a full-blown address space manager. Considering iova allocation efficiency, I think the iova allocator is better here. In most cases, we don't even need to hold a spin lock during iova allocation. If you're going to reuse any infrastructure I'd have expected it to be SWIOTLB rather than the IOVA allocator. Because, y'know, you're *literally implementing a software I/O TLB* ;) But actually what we can reuse in SWIOTLB is the IOVA allocator. Huh? Those are completely unrelated and orthogonal things - SWIOTLB does not use an external allocator (see find_slots()). By SWIOTLB I mean specifically the library itself, not dma-direct or any of the other users built around it. The functionality for managing slots in a buffer and bouncing data in and out can absolutely be reused - that's why users like the Xen and iommu-dma code *are* reusing it instead of open-coding their own versions. And the IOVA management in SWIOTLB is not what we want. For example, SWIOTLB allocates and uses contiguous memory for bouncing, which is not necessary in VDUSE case. alloc_iova() allocates a contiguous (in IOVA address) region of space. In vduse_domain_map_page() you use it to allocate a contiguous region of space from your bounce buffer. Can you clarify how that is fundamentally different from allocating a contiguous region of space from a bounce buffer? Nobody's saying the underlying implementation details of where the buffer itself comes from can't be tweaked. And VDUSE needs coherent mapping which is not supported by the SWIOTLB. Besides, the SWIOTLB works in singleton mode (designed for platform IOMMU) , but VDUSE is based on on-chip IOMMU (supports multiple instances). That's not entirely true - the IOMMU bounce buffering scheme introduced in intel-iommu and now moved into the iommu-dma layer was already a step towards something conceptually similar. It does still rely on stealing the underlying pages from the global SWIOTLB pool at the moment, but the bouncing is effectively done in a per-IOMMU-domain context. The next step is currently queued in linux-next, wherein we can now have individual per-device SWIOTLB pools. In fact a
Re: [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast()
On 2021-08-03 09:54, Yongji Xie wrote: On Tue, Aug 3, 2021 at 3:41 PM Jason Wang wrote: 在 2021/7/29 下午3:34, Xie Yongji 写道: Export alloc_iova_fast() and free_iova_fast() so that some modules can use it to improve iova allocation efficiency. It's better to explain why alloc_iova() is not sufficient here. Fine. What I fail to understand from the later patches is what the IOVA domain actually represents. If the "device" is a userspace process then logically the "IOVA" would be the userspace address, so presumably somewhere you're having to translate between this arbitrary address space and actual usable addresses - if you're worried about efficiency surely it would be even better to not do that? Presumably userspace doesn't have any concern about alignment and the things we have to worry about for the DMA API in general, so it's pretty much just allocating slots in a buffer, and there are far more effective ways to do that than a full-blown address space manager. If you're going to reuse any infrastructure I'd have expected it to be SWIOTLB rather than the IOVA allocator. Because, y'know, you're *literally implementing a software I/O TLB* ;) Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC v1 3/8] intel/vt-d: make DMAR table parsing code more flexible
On 2021-07-09 12:43, Wei Liu wrote: Microsoft Hypervisor provides a set of hypercalls to manage device domains. The root kernel should parse the DMAR so that it can program the IOMMU (with hypercalls) correctly. The DMAR code was designed to work with Intel IOMMU only. Add two more parameters to make it useful to Microsoft Hypervisor. Microsoft Hypervisor does not need the DMAR parsing code to allocate an Intel IOMMU structure; it also wishes to always reparse the DMAR table even after it has been parsed before. We've recently defined the VIOT table for describing paravirtualised IOMMUs - would it make more sense to extend that to support the Microsoft implementation than to abuse a hardware-specific table? Am I right in assuming said hypervisor isn't intended to only ever run on Intel hardware? Robin. Adjust Intel IOMMU code to use the new dmar_table_init. There should be no functional change to Intel IOMMU code. Signed-off-by: Wei Liu --- We may be able to combine alloc and force_parse? --- drivers/iommu/intel/dmar.c | 38 - drivers/iommu/intel/iommu.c | 2 +- drivers/iommu/intel/irq_remapping.c | 2 +- include/linux/dmar.h| 2 +- 4 files changed, 30 insertions(+), 14 deletions(-) diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c index 84057cb9596c..bd72f47c728b 100644 --- a/drivers/iommu/intel/dmar.c +++ b/drivers/iommu/intel/dmar.c @@ -408,7 +408,8 @@ dmar_find_dmaru(struct acpi_dmar_hardware_unit *drhd) * structure which uniquely represent one DMA remapping hardware unit * present in the platform */ -static int dmar_parse_one_drhd(struct acpi_dmar_header *header, void *arg) +static int dmar_parse_one_drhd_internal(struct acpi_dmar_header *header, + void *arg, bool alloc) { struct acpi_dmar_hardware_unit *drhd; struct dmar_drhd_unit *dmaru; @@ -440,12 +441,14 @@ static int dmar_parse_one_drhd(struct acpi_dmar_header *header, void *arg) return -ENOMEM; } - ret = alloc_iommu(dmaru); - if (ret) { - dmar_free_dev_scope(>devices, - >devices_cnt); - kfree(dmaru); - return ret; + if (alloc) { + ret = alloc_iommu(dmaru); + if (ret) { + dmar_free_dev_scope(>devices, + >devices_cnt); + kfree(dmaru); + return ret; + } } dmar_register_drhd_unit(dmaru); @@ -456,6 +459,16 @@ static int dmar_parse_one_drhd(struct acpi_dmar_header *header, void *arg) return 0; } +static int dmar_parse_one_drhd(struct acpi_dmar_header *header, void *arg) +{ + return dmar_parse_one_drhd_internal(header, arg, true); +} + +int dmar_parse_one_drhd_noalloc(struct acpi_dmar_header *header, void *arg) +{ + return dmar_parse_one_drhd_internal(header, arg, false); +} + static void dmar_free_drhd(struct dmar_drhd_unit *dmaru) { if (dmaru->devices && dmaru->devices_cnt) @@ -633,7 +646,7 @@ static inline int dmar_walk_dmar_table(struct acpi_table_dmar *dmar, * parse_dmar_table - parses the DMA reporting table */ static int __init -parse_dmar_table(void) +parse_dmar_table(bool alloc) { struct acpi_table_dmar *dmar; int drhd_count = 0; @@ -650,6 +663,9 @@ parse_dmar_table(void) .cb[ACPI_DMAR_TYPE_SATC] = _parse_one_satc, }; + if (!alloc) + cb.cb[ACPI_DMAR_TYPE_HARDWARE_UNIT] = _parse_one_drhd_noalloc; + /* * Do it again, earlier dmar_tbl mapping could be mapped with * fixed map. @@ -840,13 +856,13 @@ void __init dmar_register_bus_notifier(void) } -int __init dmar_table_init(void) +int __init dmar_table_init(bool alloc, bool force_parse) { static int dmar_table_initialized; int ret; - if (dmar_table_initialized == 0) { - ret = parse_dmar_table(); + if (dmar_table_initialized == 0 || force_parse) { + ret = parse_dmar_table(alloc); if (ret < 0) { if (ret != -ENODEV) pr_info("Parse DMAR table failure.\n"); diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index be35284a2016..a4294d310b93 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -4310,7 +4310,7 @@ int __init intel_iommu_init(void) } down_write(_global_lock); - if (dmar_table_init()) { + if (dmar_table_init(true, false)) { if (force_on) panic("tboot: Failed to initialize DMAR table\n"); goto out_free_dmar; diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c index f912fe45bea2..0e8abef862e4 100644 --- a/drivers/iommu/intel/irq_remapping.c
Re: [RFC v1 6/8] mshv: command line option to skip devices in PV-IOMMU
On 2021-07-09 12:43, Wei Liu wrote: Some devices may have been claimed by the hypervisor already. One such example is a user can assign a NIC for debugging purpose. Ideally Linux should be able to tell retrieve that information, but there is no way to do that yet. And designing that new mechanism is going to take time. Provide a command line option for skipping devices. This is a stopgap solution, so it is intentionally undocumented. Hopefully we can retire it in the future. Huh? If the host is using a device, why the heck is it exposing any knowledge of that device to the guest at all, let alone allowing the guest to do anything that could affect its operation!? Robin. Signed-off-by: Wei Liu --- drivers/iommu/hyperv-iommu.c | 45 1 file changed, 45 insertions(+) diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c index 043dcff06511..353da5036387 100644 --- a/drivers/iommu/hyperv-iommu.c +++ b/drivers/iommu/hyperv-iommu.c @@ -349,6 +349,16 @@ static const struct irq_domain_ops hyperv_root_ir_domain_ops = { #ifdef CONFIG_HYPERV_ROOT_PVIOMMU +/* The IOMMU will not claim these PCI devices. */ +static char *pci_devs_to_skip; +static int __init mshv_iommu_setup_skip(char *str) { + pci_devs_to_skip = str; + + return 0; +} +/* mshv_iommu_skip=(:BB:DD.F)(:BB:DD.F) */ +__setup("mshv_iommu_skip=", mshv_iommu_setup_skip); + /* DMA remapping support */ struct hv_iommu_domain { struct iommu_domain domain; @@ -774,6 +784,41 @@ static struct iommu_device *hv_iommu_probe_device(struct device *dev) if (!dev_is_pci(dev)) return ERR_PTR(-ENODEV); + /* +* Skip the PCI device specified in `pci_devs_to_skip`. This is a +* temporary solution until we figure out a way to extract information +* from the hypervisor what devices it is already using. +*/ + if (pci_devs_to_skip && *pci_devs_to_skip) { + int pos = 0; + int parsed; + int segment, bus, slot, func; + struct pci_dev *pdev = to_pci_dev(dev); + + do { + parsed = 0; + + sscanf(pci_devs_to_skip + pos, + " (%x:%x:%x.%x) %n", + , , , , ); + + if (parsed <= 0) + break; + + if (pci_domain_nr(pdev->bus) == segment && + pdev->bus->number == bus && + PCI_SLOT(pdev->devfn) == slot && + PCI_FUNC(pdev->devfn) == func) + { + dev_info(dev, "skipped by MSHV IOMMU\n"); + return ERR_PTR(-ENODEV); + } + + pos += parsed; + + } while (pci_devs_to_skip[pos]); + } + vdev = kzalloc(sizeof(*vdev), GFP_KERNEL); if (!vdev) return ERR_PTR(-ENOMEM); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 2/5] ACPI: Move IOMMU setup code out of IORT
On 2021-06-18 16:20, Jean-Philippe Brucker wrote: Extract the code that sets up the IOMMU infrastructure from IORT, since it can be reused by VIOT. Move it one level up into a new acpi_iommu_configure_id() function, which calls the IORT parsing function which in turn calls the acpi_iommu_fwspec_init() helper. Reviewed-by: Robin Murphy Signed-off-by: Jean-Philippe Brucker --- include/acpi/acpi_bus.h | 3 ++ include/linux/acpi_iort.h | 8 ++--- drivers/acpi/arm64/iort.c | 74 +-- drivers/acpi/scan.c | 73 +- 4 files changed, 86 insertions(+), 72 deletions(-) diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h index 3a82faac5767..41f092a269f6 100644 --- a/include/acpi/acpi_bus.h +++ b/include/acpi/acpi_bus.h @@ -588,6 +588,9 @@ struct acpi_pci_root { bool acpi_dma_supported(struct acpi_device *adev); enum dev_dma_attr acpi_get_dma_attr(struct acpi_device *adev); +int acpi_iommu_fwspec_init(struct device *dev, u32 id, + struct fwnode_handle *fwnode, + const struct iommu_ops *ops); int acpi_dma_get_range(struct device *dev, u64 *dma_addr, u64 *offset, u64 *size); int acpi_dma_configure_id(struct device *dev, enum dev_dma_attr attr, diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h index f7f054833afd..f1f0842a2cb2 100644 --- a/include/linux/acpi_iort.h +++ b/include/linux/acpi_iort.h @@ -35,8 +35,7 @@ void acpi_configure_pmsi_domain(struct device *dev); int iort_pmsi_get_dev_id(struct device *dev, u32 *dev_id); /* IOMMU interface */ int iort_dma_get_ranges(struct device *dev, u64 *size); -const struct iommu_ops *iort_iommu_configure_id(struct device *dev, - const u32 *id_in); +int iort_iommu_configure_id(struct device *dev, const u32 *id_in); int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head); phys_addr_t acpi_iort_dma_get_max_cpu_address(void); #else @@ -50,9 +49,8 @@ static inline void acpi_configure_pmsi_domain(struct device *dev) { } /* IOMMU interface */ static inline int iort_dma_get_ranges(struct device *dev, u64 *size) { return -ENODEV; } -static inline const struct iommu_ops *iort_iommu_configure_id( - struct device *dev, const u32 *id_in) -{ return NULL; } +static inline int iort_iommu_configure_id(struct device *dev, const u32 *id_in) +{ return -ENODEV; } static inline int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head) { return 0; } diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c index a940be1cf2af..487d1095030d 100644 --- a/drivers/acpi/arm64/iort.c +++ b/drivers/acpi/arm64/iort.c @@ -806,23 +806,6 @@ static struct acpi_iort_node *iort_get_msi_resv_iommu(struct device *dev) return NULL; } -static inline const struct iommu_ops *iort_fwspec_iommu_ops(struct device *dev) -{ - struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); - - return (fwspec && fwspec->ops) ? fwspec->ops : NULL; -} - -static inline int iort_add_device_replay(struct device *dev) -{ - int err = 0; - - if (dev->bus && !device_iommu_mapped(dev)) - err = iommu_probe_device(dev); - - return err; -} - /** * iort_iommu_msi_get_resv_regions - Reserved region driver helper * @dev: Device from iommu_get_resv_regions() @@ -900,18 +883,6 @@ static inline bool iort_iommu_driver_enabled(u8 type) } } -static int arm_smmu_iort_xlate(struct device *dev, u32 streamid, - struct fwnode_handle *fwnode, - const struct iommu_ops *ops) -{ - int ret = iommu_fwspec_init(dev, fwnode, ops); - - if (!ret) - ret = iommu_fwspec_add_ids(dev, , 1); - - return ret; -} - static bool iort_pci_rc_supports_ats(struct acpi_iort_node *node) { struct acpi_iort_root_complex *pci_rc; @@ -946,7 +917,7 @@ static int iort_iommu_xlate(struct device *dev, struct acpi_iort_node *node, return iort_iommu_driver_enabled(node->type) ? -EPROBE_DEFER : -ENODEV; - return arm_smmu_iort_xlate(dev, streamid, iort_fwnode, ops); + return acpi_iommu_fwspec_init(dev, streamid, iort_fwnode, ops); } struct iort_pci_alias_info { @@ -1020,24 +991,13 @@ static int iort_nc_iommu_map_id(struct device *dev, * @dev: device to configure * @id_in: optional input id const value pointer * - * Returns: iommu_ops pointer on configuration success - * NULL on configuration failure + * Returns: 0 on success, <0 on failure */ -const struct iommu_ops *iort_iommu_configure_id(struct device *dev, - const u32 *id_in) +int iort_iommu_configure_id(struct device *dev, const u32 *id_in) {
Re: [PATCH v5 1/5] ACPI: arm64: Move DMA setup operations out of IORT
On 2021-06-18 16:20, Jean-Philippe Brucker wrote: Extract generic DMA setup code out of IORT, so it can be reused by VIOT. Keep it in drivers/acpi/arm64 for now, since it could break x86 platforms that haven't run this code so far, if they have invalid tables. Reviewed-by: Robin Murphy Reviewed-by: Eric Auger Signed-off-by: Jean-Philippe Brucker --- drivers/acpi/arm64/Makefile | 1 + include/linux/acpi.h| 3 +++ include/linux/acpi_iort.h | 6 ++--- drivers/acpi/arm64/dma.c| 50 ++ drivers/acpi/arm64/iort.c | 54 ++--- drivers/acpi/scan.c | 2 +- 6 files changed, 66 insertions(+), 50 deletions(-) create mode 100644 drivers/acpi/arm64/dma.c diff --git a/drivers/acpi/arm64/Makefile b/drivers/acpi/arm64/Makefile index 6ff50f4ed947..66acbe77f46e 100644 --- a/drivers/acpi/arm64/Makefile +++ b/drivers/acpi/arm64/Makefile @@ -1,3 +1,4 @@ # SPDX-License-Identifier: GPL-2.0-only obj-$(CONFIG_ACPI_IORT) += iort.o obj-$(CONFIG_ACPI_GTDT) += gtdt.o +obj-y += dma.o diff --git a/include/linux/acpi.h b/include/linux/acpi.h index c60745f657e9..7aaa9559cc19 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -259,9 +259,12 @@ void acpi_numa_x2apic_affinity_init(struct acpi_srat_x2apic_cpu_affinity *pa); #ifdef CONFIG_ARM64 void acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa); +void acpi_arch_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size); #else static inline void acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa) { } +static inline void +acpi_arch_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size) { } #endif int acpi_numa_memory_affinity_init (struct acpi_srat_mem_affinity *ma); diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h index 1a12baa58e40..f7f054833afd 100644 --- a/include/linux/acpi_iort.h +++ b/include/linux/acpi_iort.h @@ -34,7 +34,7 @@ struct irq_domain *iort_get_device_domain(struct device *dev, u32 id, void acpi_configure_pmsi_domain(struct device *dev); int iort_pmsi_get_dev_id(struct device *dev, u32 *dev_id); /* IOMMU interface */ -void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 *size); +int iort_dma_get_ranges(struct device *dev, u64 *size); const struct iommu_ops *iort_iommu_configure_id(struct device *dev, const u32 *id_in); int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head); @@ -48,8 +48,8 @@ static inline struct irq_domain *iort_get_device_domain( { return NULL; } static inline void acpi_configure_pmsi_domain(struct device *dev) { } /* IOMMU interface */ -static inline void iort_dma_setup(struct device *dev, u64 *dma_addr, - u64 *size) { } +static inline int iort_dma_get_ranges(struct device *dev, u64 *size) +{ return -ENODEV; } static inline const struct iommu_ops *iort_iommu_configure_id( struct device *dev, const u32 *id_in) { return NULL; } diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c new file mode 100644 index ..f16739ad3cc0 --- /dev/null +++ b/drivers/acpi/arm64/dma.c @@ -0,0 +1,50 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include +#include + +void acpi_arch_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size) +{ + int ret; + u64 end, mask; + u64 dmaaddr = 0, size = 0, offset = 0; + + /* +* If @dev is expected to be DMA-capable then the bus code that created +* it should have initialised its dma_mask pointer by this point. For +* now, we'll continue the legacy behaviour of coercing it to the +* coherent mask if not, but we'll no longer do so quietly. +*/ + if (!dev->dma_mask) { + dev_warn(dev, "DMA mask not set\n"); + dev->dma_mask = >coherent_dma_mask; + } + + if (dev->coherent_dma_mask) + size = max(dev->coherent_dma_mask, dev->coherent_dma_mask + 1); + else + size = 1ULL << 32; + + ret = acpi_dma_get_range(dev, , , ); + if (ret == -ENODEV) + ret = iort_dma_get_ranges(dev, ); + if (!ret) { + /* +* Limit coherent and dma mask based on size retrieved from +* firmware. +*/ + end = dmaaddr + size - 1; + mask = DMA_BIT_MASK(ilog2(end) + 1); + dev->bus_dma_limit = end; + dev->coherent_dma_mask = min(dev->coherent_dma_mask, mask); + *dev->dma_mask = min(*dev->dma_mask, mask); + } + + *dma_addr = dmaaddr; + *dma_size = size; + + ret = dma_direct_set_offset(dev, dmaaddr + offset, dmaaddr, size); + + dev_dbg(dev,
Re: [PATCH v5 4/5] iommu/dma: Pass address limit rather than size to iommu_setup_dma_ops()
On 2021-06-18 16:20, Jean-Philippe Brucker wrote: Passing a 64-bit address width to iommu_setup_dma_ops() is valid on virtual platforms, but isn't currently possible. The overflow check in iommu_dma_init_domain() prevents this even when @dma_base isn't 0. Pass a limit address instead of a size, so callers don't have to fake a size to work around the check. The base and limit parameters are being phased out, because: * they are redundant for x86 callers. dma-iommu already reserves the first page, and the upper limit is already in domain->geometry. * they can now be obtained from dev->dma_range_map on Arm. But removing them on Arm isn't completely straightforward so is left for future work. As an intermediate step, simplify the x86 callers by passing dummy limits. Reviewed-by: Robin Murphy Signed-off-by: Jean-Philippe Brucker --- include/linux/dma-iommu.h | 4 ++-- arch/arm64/mm/dma-mapping.c | 2 +- drivers/iommu/amd/iommu.c | 2 +- drivers/iommu/dma-iommu.c | 12 ++-- drivers/iommu/intel/iommu.c | 5 + 5 files changed, 11 insertions(+), 14 deletions(-) diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h index 6e75a2d689b4..758ca4694257 100644 --- a/include/linux/dma-iommu.h +++ b/include/linux/dma-iommu.h @@ -19,7 +19,7 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base); void iommu_put_dma_cookie(struct iommu_domain *domain); /* Setup call for arch DMA mapping code */ -void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size); +void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit); /* The DMA API isn't _quite_ the whole story, though... */ /* @@ -50,7 +50,7 @@ struct msi_msg; struct device; static inline void iommu_setup_dma_ops(struct device *dev, u64 dma_base, - u64 size) + u64 dma_limit) { } diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c index 4bf1dd3eb041..6719f9efea09 100644 --- a/arch/arm64/mm/dma-mapping.c +++ b/arch/arm64/mm/dma-mapping.c @@ -50,7 +50,7 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, dev->dma_coherent = coherent; if (iommu) - iommu_setup_dma_ops(dev, dma_base, size); + iommu_setup_dma_ops(dev, dma_base, dma_base + size - 1); #ifdef CONFIG_XEN if (xen_swiotlb_detect()) diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index 3ac42bbdefc6..216323fb27ef 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -1713,7 +1713,7 @@ static void amd_iommu_probe_finalize(struct device *dev) /* Domains are initialized for this device - have a look what we ended up with */ domain = iommu_get_domain_for_dev(dev); if (domain->type == IOMMU_DOMAIN_DMA) - iommu_setup_dma_ops(dev, IOVA_START_PFN << PAGE_SHIFT, 0); + iommu_setup_dma_ops(dev, 0, U64_MAX); else set_dma_ops(dev, NULL); } diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 7bcdd1205535..c62e19bed302 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -319,16 +319,16 @@ static bool dev_is_untrusted(struct device *dev) * iommu_dma_init_domain - Initialise a DMA mapping domain * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie() * @base: IOVA at which the mappable address space starts - * @size: Size of IOVA space + * @limit: Last address of the IOVA space * @dev: Device the domain is being initialised for * - * @base and @size should be exact multiples of IOMMU page granularity to + * @base and @limit + 1 should be exact multiples of IOMMU page granularity to * avoid rounding surprises. If necessary, we reserve the page at address 0 * to ensure it is an invalid IOVA. It is safe to reinitialise a domain, but * any change which could make prior IOVAs invalid will fail. */ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, - u64 size, struct device *dev) +dma_addr_t limit, struct device *dev) { struct iommu_dma_cookie *cookie = domain->iova_cookie; unsigned long order, base_pfn; @@ -346,7 +346,7 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, /* Check the domain allows at least some access to the device... */ if (domain->geometry.force_aperture) { if (base > domain->geometry.aperture_end || - base + size <= domain->geometry.aperture_start) { + limit < domain->geometry.aperture_start) { pr_warn("specified DMA range outside IOMMU capability\n"); return -EFAULT; } @@ -1308,7 +1308,7 @@ static const struct dma_map_ops iommu_dma_ops = { *
Re: [PATCH v4 5/6] iommu/dma: Simplify calls to iommu_setup_dma_ops()
On 2021-06-18 11:50, Jean-Philippe Brucker wrote: On Wed, Jun 16, 2021 at 06:02:39PM +0100, Robin Murphy wrote: diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index c62e19bed302..175f8eaeb5b3 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1322,7 +1322,9 @@ void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit) if (domain->type == IOMMU_DOMAIN_DMA) { if (iommu_dma_init_domain(domain, dma_base, dma_limit, dev)) goto out_err; - dev->dma_ops = _dma_ops; + set_dma_ops(dev, _dma_ops); + } else { + set_dma_ops(dev, NULL); I'm not keen on moving this here, since iommu-dma only knows that its own ops are right for devices it *is* managing; it can't assume any particular ops are appropriate for devices it isn't. The idea here is that arch_setup_dma_ops() may have already set the appropriate ops for the non-IOMMU case, so if the default domain type is passthrough then we leave those in place. For example, I do still plan to revisit my conversion of arch/arm someday, at which point I'd have to undo this for that reason. Makes sense, I'll remove this bit. Simplifying the base and size arguments is of course fine, but TBH I'd say rip the whole bloody lot out of the arch_setup_dma_ops() flow now. It's a considerable faff passing them around for nothing but a tenuous sanity check in iommu_dma_init_domain(), and now that dev->dma_range_map is a common thing we should expect that to give us any relevant limitations if we even still care. So I started working on this but it gets too bulky for a preparatory patch. Dropping the parameters from arch_setup_dma_ops() seems especially complicated because arm32 does need the size parameter for IOMMU mappings and that value falls back to the bus DMA mask or U32_MAX in the absence of dma-ranges. I could try to dig into this for a separate series. Even only dropping the parameters from iommu_setup_dma_ops() isn't completely trivial (8 files changed, 55 insertions(+), 36 deletions(-) because we still need the lower IOVA limit from dma_range_map), so I'd rather send it separately and have it sit in -next for a while. Oh, sure, I didn't mean to imply that the whole cleanup should be within the scope of this series, just that we can shave off as much as we *do* need to touch here (which TBH is pretty much what you're doing already), and mainly to start taking the attitude that these arguments are now superseded and increasingly vestigial. I expected the cross-arch cleanup to be a bit fiddly, but I'd forgotten that arch/arm was still actively using these values, so maybe I can revisit this when I pick up my iommu-dma conversion again (I swear it's not dead, just resting!) Cheers, Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 2/6] ACPI: Move IOMMU setup code out of IORT
On 2021-06-18 08:41, Jean-Philippe Brucker wrote: Hi Eric, On Wed, Jun 16, 2021 at 11:35:13AM +0200, Eric Auger wrote: -const struct iommu_ops *iort_iommu_configure_id(struct device *dev, - const u32 *id_in) +int iort_iommu_configure_id(struct device *dev, const u32 *id_in) { struct acpi_iort_node *node; - const struct iommu_ops *ops; + const struct iommu_ops *ops = NULL; Oops, I need to remove this (and add -Werror to my tests.) +static const struct iommu_ops *acpi_iommu_configure_id(struct device *dev, + const u32 *id_in) +{ + int err; + const struct iommu_ops *ops; + + /* +* If we already translated the fwspec there is nothing left to do, +* return the iommu_ops. +*/ + ops = acpi_iommu_fwspec_ops(dev); + if (ops) + return ops; + + err = iort_iommu_configure_id(dev, id_in); + + /* +* If we have reason to believe the IOMMU driver missed the initial +* add_device callback for dev, replay it to get things in order. +*/ + if (!err && dev->bus && !device_iommu_mapped(dev)) + err = iommu_probe_device(dev); Previously we had: if (!err) { ops = iort_fwspec_iommu_ops(dev); err = iort_add_device_replay(dev); } Please can you explain the transform? I see the acpi_iommu_fwspec_ops call below but is it not straightforward to me. I figured that iort_add_device_replay() is only used once and is sufficiently simple to be inlined manually (saving 10 lines). Then I replaced the ops assignment with returns, which saves another line and may be slightly clearer? I guess it's mostly a matter of taste, the behavior should be exactly the same. Right, IIRC the multiple assignments to ops were more of a haphazard evolution inherited from the DT version, and looking at it now I think the multiple-return is indeed a bit nicer. Similarly, it looks like the factoring out of iort_add_device_replay() was originally an attempt to encapsulate the IOMMU_API dependency, but things have moved around a lot since then, so that seems like a sensible simplification to make too. Robin. Also the comment mentions replay. Unsure if it is still OK. The "replay" part is, but "add_device" isn't accurate because it has since been replaced by probe_device. I'll refresh the comment. Thanks, Jean ___ iommu mailing list io...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 5/6] iommu/dma: Simplify calls to iommu_setup_dma_ops()
On 2021-06-10 08:51, Jean-Philippe Brucker wrote: dma-iommu uses the address bounds described in domain->geometry during IOVA allocation. The address size parameters of iommu_setup_dma_ops() are useful for describing additional limits set by the platform firmware, but aren't needed for drivers that call this function from probe_finalize(). The base parameter can be zero because dma-iommu already removes the first IOVA page, and the limit parameter can be U64_MAX because it's only checked against the domain geometry. Simplify calls to iommu_setup_dma_ops(). Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/amd/iommu.c | 9 + drivers/iommu/dma-iommu.c | 4 +++- drivers/iommu/intel/iommu.c | 10 +- 3 files changed, 5 insertions(+), 18 deletions(-) diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index 94b96d81fcfd..d3123bc05c08 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -1708,14 +1708,7 @@ static struct iommu_device *amd_iommu_probe_device(struct device *dev) static void amd_iommu_probe_finalize(struct device *dev) { - struct iommu_domain *domain; - - /* Domains are initialized for this device - have a look what we ended up with */ - domain = iommu_get_domain_for_dev(dev); - if (domain->type == IOMMU_DOMAIN_DMA) - iommu_setup_dma_ops(dev, IOVA_START_PFN << PAGE_SHIFT, U64_MAX); - else - set_dma_ops(dev, NULL); + iommu_setup_dma_ops(dev, 0, U64_MAX); } static void amd_iommu_release_device(struct device *dev) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index c62e19bed302..175f8eaeb5b3 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1322,7 +1322,9 @@ void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit) if (domain->type == IOMMU_DOMAIN_DMA) { if (iommu_dma_init_domain(domain, dma_base, dma_limit, dev)) goto out_err; - dev->dma_ops = _dma_ops; + set_dma_ops(dev, _dma_ops); + } else { + set_dma_ops(dev, NULL); I'm not keen on moving this here, since iommu-dma only knows that its own ops are right for devices it *is* managing; it can't assume any particular ops are appropriate for devices it isn't. The idea here is that arch_setup_dma_ops() may have already set the appropriate ops for the non-IOMMU case, so if the default domain type is passthrough then we leave those in place. For example, I do still plan to revisit my conversion of arch/arm someday, at which point I'd have to undo this for that reason. Simplifying the base and size arguments is of course fine, but TBH I'd say rip the whole bloody lot out of the arch_setup_dma_ops() flow now. It's a considerable faff passing them around for nothing but a tenuous sanity check in iommu_dma_init_domain(), and now that dev->dma_range_map is a common thing we should expect that to give us any relevant limitations if we even still care. That said, those are all things which can be fixed up later if the series is otherwise ready to go and there's still a chance of landing it for 5.14. If you do have any other reason to respin, then I think the x86 probe_finalize functions simply want an unconditional set_dma_ops(dev, NULL) before the iommu_setup_dma_ops() call. Cheers, Robin. } return; diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 85f18342603c..8d866940692a 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -5165,15 +5165,7 @@ static void intel_iommu_release_device(struct device *dev) static void intel_iommu_probe_finalize(struct device *dev) { - dma_addr_t base = IOVA_START_PFN << VTD_PAGE_SHIFT; - struct iommu_domain *domain = iommu_get_domain_for_dev(dev); - struct dmar_domain *dmar_domain = to_dmar_domain(domain); - - if (domain && domain->type == IOMMU_DOMAIN_DMA) - iommu_setup_dma_ops(dev, base, - __DOMAIN_MAX_ADDR(dmar_domain->gaw)); - else - set_dma_ops(dev, NULL); + iommu_setup_dma_ops(dev, 0, U64_MAX); } static void intel_iommu_get_resv_regions(struct device *device, ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v1 5/8] dma: Use size for swiotlb boundary checks
On 2021-06-03 01:41, Andi Kleen wrote: swiotlb currently only uses the start address of a DMA to check if something is in the swiotlb or not. But with virtio and untrusted hosts the host could give some DMA mapping that crosses the swiotlb boundaries, potentially leaking or corrupting data. Add size checks to all the swiotlb checks and reject any DMAs that cross the swiotlb buffer boundaries. Signed-off-by: Andi Kleen --- drivers/iommu/dma-iommu.c | 13 ++--- drivers/xen/swiotlb-xen.c | 11 ++- include/linux/dma-mapping.h | 4 ++-- include/linux/swiotlb.h | 8 +--- kernel/dma/direct.c | 8 kernel/dma/direct.h | 8 kernel/dma/mapping.c| 4 ++-- net/xdp/xsk_buff_pool.c | 2 +- 8 files changed, 30 insertions(+), 28 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 7bcdd1205535..7ef13198721b 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -504,7 +504,7 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr, __iommu_dma_unmap(dev, dma_addr, size); If you can't trust size below then you've already corrupted the IOMMU pagetables here :/ Robin. - if (unlikely(is_swiotlb_buffer(phys))) + if (unlikely(is_swiotlb_buffer(phys, size))) swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs); } @@ -575,7 +575,7 @@ static dma_addr_t __iommu_dma_map_swiotlb(struct device *dev, phys_addr_t phys, } iova = __iommu_dma_map(dev, phys, aligned_size, prot, dma_mask); - if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys)) + if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys, org_size)) swiotlb_tbl_unmap_single(dev, phys, org_size, dir, attrs); return iova; } @@ -781,7 +781,7 @@ static void iommu_dma_sync_single_for_cpu(struct device *dev, if (!dev_is_dma_coherent(dev)) arch_sync_dma_for_cpu(phys, size, dir); - if (is_swiotlb_buffer(phys)) + if (is_swiotlb_buffer(phys, size)) swiotlb_sync_single_for_cpu(dev, phys, size, dir); } @@ -794,7 +794,7 @@ static void iommu_dma_sync_single_for_device(struct device *dev, return; phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle); - if (is_swiotlb_buffer(phys)) + if (is_swiotlb_buffer(phys, size)) swiotlb_sync_single_for_device(dev, phys, size, dir); if (!dev_is_dma_coherent(dev)) @@ -815,7 +815,7 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev, if (!dev_is_dma_coherent(dev)) arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir); - if (is_swiotlb_buffer(sg_phys(sg))) + if (is_swiotlb_buffer(sg_phys(sg), sg->length)) swiotlb_sync_single_for_cpu(dev, sg_phys(sg), sg->length, dir); } @@ -832,10 +832,9 @@ static void iommu_dma_sync_sg_for_device(struct device *dev, return; for_each_sg(sgl, sg, nelems, i) { - if (is_swiotlb_buffer(sg_phys(sg))) + if (is_swiotlb_buffer(sg_phys(sg), sg->length)) swiotlb_sync_single_for_device(dev, sg_phys(sg), sg->length, dir); - if (!dev_is_dma_coherent(dev)) arch_sync_dma_for_device(sg_phys(sg), sg->length, dir); } diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c index 24d11861ac7d..333846af8d35 100644 --- a/drivers/xen/swiotlb-xen.c +++ b/drivers/xen/swiotlb-xen.c @@ -89,7 +89,8 @@ static inline int range_straddles_page_boundary(phys_addr_t p, size_t size) return 0; } -static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr) +static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr, +size_t size) { unsigned long bfn = XEN_PFN_DOWN(dma_to_phys(dev, dma_addr)); unsigned long xen_pfn = bfn_to_local_pfn(bfn); @@ -100,7 +101,7 @@ static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr) * in our domain. Therefore _only_ check address within our domain. */ if (pfn_valid(PFN_DOWN(paddr))) - return is_swiotlb_buffer(paddr); + return is_swiotlb_buffer(paddr, size); return 0; } @@ -431,7 +432,7 @@ static void xen_swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr, } /* NOTE: We use dev_addr here, not paddr! */ - if (is_xen_swiotlb_buffer(hwdev, dev_addr)) + if (is_xen_swiotlb_buffer(hwdev, dev_addr, size)) swiotlb_tbl_unmap_single(hwdev, paddr, size, dir, attrs); } @@ -448,7 +449,7 @@ xen_swiotlb_sync_single_for_cpu(struct device *dev, dma_addr_t dma_addr,
Re: [PATCH v1 6/8] dma: Add return value to dma_unmap_page
Hi Andi, On 2021-06-03 01:41, Andi Kleen wrote: In some situations when we know swiotlb is forced and we have to deal with untrusted hosts, it's useful to know if a mapping was in the swiotlb or not. This allows us to abort any IO operation that would access memory outside the swiotlb. Otherwise it might be possible for a malicious host to inject any guest page in a read operation. While it couldn't directly access the results of the read() inside the guest, there might scenarios where data is echoed back with a write(), and that would then leak guest memory. Add a return value to dma_unmap_single/page. Most users of course will ignore it. The return value is set to EIO if we're in forced swiotlb mode and the buffer is not inside the swiotlb buffer. Otherwise it's always 0. I have to say my first impression of this isn't too good :( What it looks like to me is abusing SWIOTLB's internal housekeeping to keep track of virtio-specific state. The DMA API does not attempt to validate calls in general since in many cases the additional overhead would be prohibitive. It has always been callers' responsibility to keep track of what they mapped and make sure sync/unmap calls match, and there are many, many, subtle and not-so-subtle ways for things to go wrong if they don't. If virtio is not doing a good enough job of that, what's the justification for making it the DMA API's problem? A new callback is used to avoid changing all the IOMMU drivers. Nit: presumably by "IOMMU drivers" you actually mean arch DMA API backends? As an aside, we'll take a look at the rest of the series for the perspective of our prototyping for Arm's Confidential Compute Architecture, but I'm not sure we'll need it, since accesses beyond the bounds of the shared SWIOTLB buffer shouldn't be an issue for us. Furthermore, AFAICS it's still not going to help against exfiltrating guest memory by over-unmapping the original SWIOTLB slot *without* going past the end of the whole buffer, but I think Martin's patch *has* addressed that already. Robin. Signed-off-by: Andi Kleen --- drivers/iommu/dma-iommu.c | 17 +++-- include/linux/dma-map-ops.h | 3 +++ include/linux/dma-mapping.h | 7 --- kernel/dma/mapping.c| 6 +- 4 files changed, 23 insertions(+), 10 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 7ef13198721b..babe46f2ae3a 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -491,7 +491,8 @@ static void __iommu_dma_unmap(struct device *dev, dma_addr_t dma_addr, iommu_dma_free_iova(cookie, dma_addr, size, iotlb_gather.freelist); } -static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr, +static int __iommu_dma_unmap_swiotlb_check(struct device *dev, + dma_addr_t dma_addr, size_t size, enum dma_data_direction dir, unsigned long attrs) { @@ -500,12 +501,15 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr, phys = iommu_iova_to_phys(domain, dma_addr); if (WARN_ON(!phys)) - return; + return -EIO; __iommu_dma_unmap(dev, dma_addr, size); if (unlikely(is_swiotlb_buffer(phys, size))) swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs); + else if (swiotlb_force == SWIOTLB_FORCE) + return -EIO; + return 0; } static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys, @@ -856,12 +860,13 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page, return dma_handle; } -static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle, +static int iommu_dma_unmap_page_check(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction dir, unsigned long attrs) { if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC)) iommu_dma_sync_single_for_cpu(dev, dma_handle, size, dir); - __iommu_dma_unmap_swiotlb(dev, dma_handle, size, dir, attrs); + return __iommu_dma_unmap_swiotlb_check(dev, dma_handle, size, dir, + attrs); } /* @@ -946,7 +951,7 @@ static void iommu_dma_unmap_sg_swiotlb(struct device *dev, struct scatterlist *s int i; for_each_sg(sg, s, nents, i) - __iommu_dma_unmap_swiotlb(dev, sg_dma_address(s), + __iommu_dma_unmap_swiotlb_check(dev, sg_dma_address(s), sg_dma_len(s), dir, attrs); } @@ -1291,7 +1296,7 @@ static const struct dma_map_ops iommu_dma_ops = { .mmap = iommu_dma_mmap, .get_sgtable= iommu_dma_get_sgtable, .map_page = iommu_dma_map_page, - .unmap_page = iommu_dma_unmap_page, + .unmap_page_check = iommu_dma_unmap_page_check, .map_sg
Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-16 15:38, Christoph Hellwig wrote: [...] diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index f1e38526d5bd40..996dfdf9d375dd 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2017,7 +2017,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain, .iommu_dev = smmu->dev, }; - if (smmu_domain->non_strict) + if (!iommu_get_dma_strict()) As Will raised, this also needs to be checking "domain->type == IOMMU_DOMAIN_DMA" to maintain equivalent behaviour to the attribute code below. pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT; pgtbl_ops = alloc_io_pgtable_ops(fmt, _cfg, smmu_domain); @@ -2449,52 +2449,6 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev) return group; } -static int arm_smmu_domain_get_attr(struct iommu_domain *domain, - enum iommu_attr attr, void *data) -{ - struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); - - switch (domain->type) { - case IOMMU_DOMAIN_DMA: - switch (attr) { - case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE: - *(int *)data = smmu_domain->non_strict; - return 0; - default: - return -ENODEV; - } - break; - default: - return -EINVAL; - } -} [...] diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index f985817c967a25..edb1de479dd1a7 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -668,7 +668,6 @@ struct arm_smmu_domain { struct mutexinit_mutex; /* Protects smmu pointer */ struct io_pgtable_ops *pgtbl_ops; - boolnon_strict; atomic_tnr_ats_masters; enum arm_smmu_domain_stage stage; diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c index 0aa6d667274970..3dde22b1f8ffb0 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c @@ -761,6 +761,9 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain, .iommu_dev = smmu->dev, }; + if (!iommu_get_dma_strict()) Ditto here. Sorry for not spotting that sooner :( Robin. + pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT; + if (smmu->impl && smmu->impl->init_context) { ret = smmu->impl->init_context(smmu_domain, _cfg, dev); if (ret) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-31 16:32, Will Deacon wrote: On Wed, Mar 31, 2021 at 02:09:37PM +0100, Robin Murphy wrote: On 2021-03-31 12:49, Will Deacon wrote: On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote: On 2021-03-30 14:58, Will Deacon wrote: On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote: On 2021-03-30 14:11, Will Deacon wrote: On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote: From: Robin Murphy Instead make the global iommu_dma_strict paramete in iommu.c canonical by exporting helpers to get and set it and use those directly in the drivers. This make sure that the iommu.strict parameter also works for the AMD and Intel IOMMU drivers on x86. As those default to lazy flushing a new IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to represent the default if not overriden by an explicit parameter. Signed-off-by: Robin Murphy . [ported on top of the other iommu_attr changes and added a few small missing bits] Signed-off-by: Christoph Hellwig --- drivers/iommu/amd/iommu.c | 23 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 - drivers/iommu/arm/arm-smmu/arm-smmu.c | 27 + drivers/iommu/dma-iommu.c | 9 +-- drivers/iommu/intel/iommu.c | 64 - drivers/iommu/iommu.c | 27 ++--- include/linux/iommu.h | 4 +- 8 files changed, 40 insertions(+), 165 deletions(-) I really like this cleanup, but I can't help wonder if it's going in the wrong direction. With SoCs often having multiple IOMMU instances and a distinction between "trusted" and "untrusted" devices, then having the flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound unreasonable to me, but this change makes it a global property. The intent here was just to streamline the existing behaviour of stuffing a global property into a domain attribute then pulling it out again in the illusion that it was in any way per-domain. We're still checking dev_is_untrusted() before making an actual decision, and it's not like we can't add more factors at that point if we want to. Like I say, the cleanup is great. I'm just wondering whether there's a better way to express the complicated logic to decide whether or not to use the flush queue than what we end up with: if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) && domain->ops->flush_iotlb_all && !iommu_get_dma_strict()) which is mixing up globals, device properties and domain properties. The result is that the driver code ends up just using the global to determine whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code, which is a departure from the current way of doing things. But previously, SMMU only ever saw the global policy piped through the domain attribute by iommu_group_alloc_default_domain(), so there's no functional change there. For DMA domains sure, but I don't think that's the case for unmanaged domains such as those used by VFIO. Eh? This is only relevant to DMA domains anyway. Flush queues are part of the IOVA allocator that VFIO doesn't even use. It's always been the case that unmanaged domains only use strict invalidation. Maybe I'm going mad. With this patch, the SMMU driver unconditionally sets IO_PGTABLE_QUIRK_NON_STRICT for page-tables if iommu_get_dma_strict() is true, no? In which case, that will get set for page-tables corresponding to unmanaged domains as well as DMA domains when it is enabled. That didn't happen before because you couldn't set the attribute for unmanaged domains. What am I missing? Oh cock... sorry, all this time I've been saying what I *expect* it to do, while overlooking the fact that the IO_PGTABLE_QUIRK_NON_STRICT hunks were the bits I forgot to write and Christoph had to fix up. Indeed, those should be checking the domain type too to preserve the existing behaviour. Apologies for the confusion. Robin. Obviously some of the above checks could be factored out into some kind of iommu_use_flush_queue() helper that IOMMU drivers can also call if they need to keep in sync. Or maybe we just allow iommu-dma to set IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if we're treating that as a generic thing now. I think a helper that takes a domain would be a good starting point. You mean device, right? The one condition we currently have is at the device level, and there's really nothing inherent to the domain itself that matters (since the type is implicitly IOMMU_DOMAIN_DMA to even care about this). Device would probably work too; you'd pass the first device to attach to the domain when querying this from the SMMU driver, I suppose. Will ___ Virtualization m
Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-31 12:49, Will Deacon wrote: On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote: On 2021-03-30 14:58, Will Deacon wrote: On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote: On 2021-03-30 14:11, Will Deacon wrote: On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote: From: Robin Murphy Instead make the global iommu_dma_strict paramete in iommu.c canonical by exporting helpers to get and set it and use those directly in the drivers. This make sure that the iommu.strict parameter also works for the AMD and Intel IOMMU drivers on x86. As those default to lazy flushing a new IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to represent the default if not overriden by an explicit parameter. Signed-off-by: Robin Murphy . [ported on top of the other iommu_attr changes and added a few small missing bits] Signed-off-by: Christoph Hellwig --- drivers/iommu/amd/iommu.c | 23 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 - drivers/iommu/arm/arm-smmu/arm-smmu.c | 27 + drivers/iommu/dma-iommu.c | 9 +-- drivers/iommu/intel/iommu.c | 64 - drivers/iommu/iommu.c | 27 ++--- include/linux/iommu.h | 4 +- 8 files changed, 40 insertions(+), 165 deletions(-) I really like this cleanup, but I can't help wonder if it's going in the wrong direction. With SoCs often having multiple IOMMU instances and a distinction between "trusted" and "untrusted" devices, then having the flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound unreasonable to me, but this change makes it a global property. The intent here was just to streamline the existing behaviour of stuffing a global property into a domain attribute then pulling it out again in the illusion that it was in any way per-domain. We're still checking dev_is_untrusted() before making an actual decision, and it's not like we can't add more factors at that point if we want to. Like I say, the cleanup is great. I'm just wondering whether there's a better way to express the complicated logic to decide whether or not to use the flush queue than what we end up with: if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) && domain->ops->flush_iotlb_all && !iommu_get_dma_strict()) which is mixing up globals, device properties and domain properties. The result is that the driver code ends up just using the global to determine whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code, which is a departure from the current way of doing things. But previously, SMMU only ever saw the global policy piped through the domain attribute by iommu_group_alloc_default_domain(), so there's no functional change there. For DMA domains sure, but I don't think that's the case for unmanaged domains such as those used by VFIO. Eh? This is only relevant to DMA domains anyway. Flush queues are part of the IOVA allocator that VFIO doesn't even use. It's always been the case that unmanaged domains only use strict invalidation. Obviously some of the above checks could be factored out into some kind of iommu_use_flush_queue() helper that IOMMU drivers can also call if they need to keep in sync. Or maybe we just allow iommu-dma to set IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if we're treating that as a generic thing now. I think a helper that takes a domain would be a good starting point. You mean device, right? The one condition we currently have is at the device level, and there's really nothing inherent to the domain itself that matters (since the type is implicitly IOMMU_DOMAIN_DMA to even care about this). Another idea that's just come to mind is now that IOMMU_DOMAIN_DMA has a standard meaning, maybe we could split out a separate IOMMU_DOMAIN_DMA_STRICT type such that it can all propagate from iommu_get_def_domain_type()? That feels like it might be quite promising, but I'd still do it as an improvement on top of this patch, since it's beyond just cleaning up the abuse of domain attributes to pass a command-line option around. Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-30 14:58, Will Deacon wrote: On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote: On 2021-03-30 14:11, Will Deacon wrote: On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote: From: Robin Murphy Instead make the global iommu_dma_strict paramete in iommu.c canonical by exporting helpers to get and set it and use those directly in the drivers. This make sure that the iommu.strict parameter also works for the AMD and Intel IOMMU drivers on x86. As those default to lazy flushing a new IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to represent the default if not overriden by an explicit parameter. Signed-off-by: Robin Murphy . [ported on top of the other iommu_attr changes and added a few small missing bits] Signed-off-by: Christoph Hellwig --- drivers/iommu/amd/iommu.c | 23 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 - drivers/iommu/arm/arm-smmu/arm-smmu.c | 27 + drivers/iommu/dma-iommu.c | 9 +-- drivers/iommu/intel/iommu.c | 64 - drivers/iommu/iommu.c | 27 ++--- include/linux/iommu.h | 4 +- 8 files changed, 40 insertions(+), 165 deletions(-) I really like this cleanup, but I can't help wonder if it's going in the wrong direction. With SoCs often having multiple IOMMU instances and a distinction between "trusted" and "untrusted" devices, then having the flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound unreasonable to me, but this change makes it a global property. The intent here was just to streamline the existing behaviour of stuffing a global property into a domain attribute then pulling it out again in the illusion that it was in any way per-domain. We're still checking dev_is_untrusted() before making an actual decision, and it's not like we can't add more factors at that point if we want to. Like I say, the cleanup is great. I'm just wondering whether there's a better way to express the complicated logic to decide whether or not to use the flush queue than what we end up with: if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) && domain->ops->flush_iotlb_all && !iommu_get_dma_strict()) which is mixing up globals, device properties and domain properties. The result is that the driver code ends up just using the global to determine whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code, which is a departure from the current way of doing things. But previously, SMMU only ever saw the global policy piped through the domain attribute by iommu_group_alloc_default_domain(), so there's no functional change there. Obviously some of the above checks could be factored out into some kind of iommu_use_flush_queue() helper that IOMMU drivers can also call if they need to keep in sync. Or maybe we just allow iommu-dma to set IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if we're treating that as a generic thing now. For example, see the recent patch from Lu Baolu: https://lore.kernel.org/r/20210225061454.2864009-1-baolu...@linux.intel.com Erm, this patch is based on that one, it's right there in the context :/ Ah, sorry, I didn't spot that! I was just trying to illustrate that this is per-device. Sure, I understand - and I'm just trying to bang home that despite appearances it's never actually been treated as such for SMMU, so anything that's wrong after this change was already wrong before. Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-30 14:11, Will Deacon wrote: On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote: From: Robin Murphy Instead make the global iommu_dma_strict paramete in iommu.c canonical by exporting helpers to get and set it and use those directly in the drivers. This make sure that the iommu.strict parameter also works for the AMD and Intel IOMMU drivers on x86. As those default to lazy flushing a new IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to represent the default if not overriden by an explicit parameter. Signed-off-by: Robin Murphy . [ported on top of the other iommu_attr changes and added a few small missing bits] Signed-off-by: Christoph Hellwig --- drivers/iommu/amd/iommu.c | 23 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 - drivers/iommu/arm/arm-smmu/arm-smmu.c | 27 + drivers/iommu/dma-iommu.c | 9 +-- drivers/iommu/intel/iommu.c | 64 - drivers/iommu/iommu.c | 27 ++--- include/linux/iommu.h | 4 +- 8 files changed, 40 insertions(+), 165 deletions(-) I really like this cleanup, but I can't help wonder if it's going in the wrong direction. With SoCs often having multiple IOMMU instances and a distinction between "trusted" and "untrusted" devices, then having the flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound unreasonable to me, but this change makes it a global property. The intent here was just to streamline the existing behaviour of stuffing a global property into a domain attribute then pulling it out again in the illusion that it was in any way per-domain. We're still checking dev_is_untrusted() before making an actual decision, and it's not like we can't add more factors at that point if we want to. For example, see the recent patch from Lu Baolu: https://lore.kernel.org/r/20210225061454.2864009-1-baolu...@linux.intel.com Erm, this patch is based on that one, it's right there in the context :/ Thanks, Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] ACPI: Add driver for the VIOT table
On 2021-03-16 19:16, Jean-Philippe Brucker wrote: The ACPI Virtual I/O Translation Table describes topology of para-virtual platforms. For now it describes the relation between virtio-iommu and the endpoints it manages. Supporting that requires three steps: (1) acpi_viot_init(): parse the VIOT table, build a list of endpoints and vIOMMUs. (2) acpi_viot_set_iommu_ops(): when the vIOMMU driver is loaded and the device probed, register it to the VIOT driver. This step is required because unlike similar drivers, VIOT doesn't create the vIOMMU device. Note that you're basically the same as the DT case in this regard, so I'd expect things to be closer to that pattern than to that of IORT. [...] @@ -1506,12 +1507,17 @@ int acpi_dma_configure_id(struct device *dev, enum dev_dma_attr attr, { const struct iommu_ops *iommu; u64 dma_addr = 0, size = 0; + int ret; if (attr == DEV_DMA_NOT_SUPPORTED) { set_dma_ops(dev, _dummy_ops); return 0; } + ret = acpi_viot_dma_setup(dev, attr); + if (ret) + return ret > 0 ? 0 : ret; I think things could do with a fair bit of refactoring here. Ideally we want to process a possible _DMA method (acpi_dma_get_range()) regardless of which flavour of IOMMU table might be present, and the amount of duplication we fork into at this point is unfortunate. + iort_dma_setup(dev, _addr, ); For starters I think most of that should be dragged out to this level here - it's really only the {rc,nc}_dma_get_range() bit that deserves to be the IORT-specific call. iommu = iort_iommu_configure_id(dev, input_id); Similarly, it feels like it's only the table scan part in the middle of that that needs dispatching between IORT/VIOT, and its head and tail pulled out into a common path. [...] +static const struct iommu_ops *viot_iommu_setup(struct device *dev) +{ + struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); + struct viot_iommu *viommu = NULL; + struct viot_endpoint *ep; + u32 epid; + int ret; + + /* Already translated? */ + if (fwspec && fwspec->ops) + return NULL; + + mutex_lock(_lock); + list_for_each_entry(ep, _endpoints, list) { + if (viot_device_match(dev, >dev_id, )) { + epid += ep->endpoint_id; + viommu = ep->viommu; + break; + } + } + mutex_unlock(_lock); + if (!viommu) + return NULL; + + /* We're not translating ourself */ + if (viot_device_match(dev, >dev_id, )) + return NULL; + + /* +* If we found a PCI range managed by the viommu, we're the one that has +* to request ACS. +*/ + if (dev_is_pci(dev)) + pci_request_acs(); + + if (!viommu->ops || WARN_ON(!viommu->dev)) + return ERR_PTR(-EPROBE_DEFER); Can you create (or look up) a viommu->fwnode when initially parsing the VIOT to represent the IOMMU devices to wait for, such that the viot_device_match() lookup can resolve to that and let you fall into the standard iommu_ops_from_fwnode() path? That's what I mean about following the DT pattern - I guess it might need a bit of trickery to rewrite things if iommu_device_register() eventually turns up with a new fwnode, so I doubt we can get away without *some* kind of private interface between virtio-iommu and VIOT, but it would be nice for the common(ish) DMA paths to stay as unaware of the specifics as possible. + + ret = iommu_fwspec_init(dev, viommu->dev->fwnode, viommu->ops); + if (ret) + return ERR_PTR(ret); + + iommu_fwspec_add_ids(dev, , 1); + + /* +* If we have reason to believe the IOMMU driver missed the initial +* add_device callback for dev, replay it to get things in order. +*/ + if (dev->bus && !device_iommu_mapped(dev)) + iommu_probe_device(dev); + + return viommu->ops; +} + +/** + * acpi_viot_dma_setup - Configure DMA for an endpoint described in VIOT + * @dev: the endpoint + * @attr: coherency property of the endpoint + * + * Setup the DMA and IOMMU ops for an endpoint described by the VIOT table. + * + * Return: + * * 0 - @dev doesn't match any VIOT node + * * 1 - ops for @dev were successfully installed + * * -EPROBE_DEFER - ops for @dev aren't yet available + */ +int acpi_viot_dma_setup(struct device *dev, enum dev_dma_attr attr) +{ + const struct iommu_ops *iommu_ops = viot_iommu_setup(dev); + + if (IS_ERR_OR_NULL(iommu_ops)) { + int ret = PTR_ERR(iommu_ops); + + if (ret == -EPROBE_DEFER || ret == 0) + return ret; + dev_err(dev, "error %d while setting up virt IOMMU\n", ret); + return 0; + } + +#ifdef CONFIG_ARCH_HAS_SETUP_DMA_OPS
Re: [PATCH 3/3] iommu/virtio: Enable x86 support
On 2021-03-16 19:16, Jean-Philippe Brucker wrote: With the VIOT support in place, x86 platforms can now use the virtio-iommu. The arm64 Kconfig selects IOMMU_DMA, while x86 IOMMU drivers select it themselves. Actually, now that both AMD and Intel are converted over, maybe it's finally time to punt that to x86 arch code to match arm64? Robin. Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/Kconfig | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index 2819b5c8ec30..ccca83ef2f06 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -400,8 +400,9 @@ config HYPERV_IOMMU config VIRTIO_IOMMU tristate "Virtio IOMMU driver" depends on VIRTIO - depends on ARM64 + depends on (ARM64 || X86) select IOMMU_API + select IOMMU_DMA if X86 select INTERVAL_TREE select ACPI_VIOT if ACPI help ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 14/17] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-15 08:33, Christoph Hellwig wrote: On Fri, Mar 12, 2021 at 04:18:24PM +, Robin Murphy wrote: Let me know what you think of the version here: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iommu-cleanup I'll happily switch the patch to you as the author if you're fine with that as well. I still have reservations about removing the attribute API entirely and pretending that io_pgtable_cfg is anything other than a SoC-specific private interface, I think a private inteface would make more sense. For now I've just condensed it down to a generic set of quirk bits and dropped the attrs structure, which seems like an ok middle ground for now. That being said I wonder why that quirk isn't simply set in the device tree? Because it's a software policy decision rather than any inherent property of the platform, and the DT certainly doesn't know *when* any particular device might prefer its IOMMU to use cacheable pagetables to minimise TLB miss latency vs. saving the cache capacity for larger data buffers. It really is most logical to decide this at the driver level. In truth the overall concept *is* relatively generic (a trend towards larger system caches and cleverer usage is about both raw performance and saving power on off-SoC DRAM traffic), it's just the particular implementation of using io-pgtable to set an outer-cacheable walk attribute in an SMMU TCR that's pretty much specific to Qualcomm SoCs. Hence why having a common abstraction at the iommu_domain level, but where the exact details are free to vary across different IOMMUs and their respective client drivers, is in many ways an ideal fit. but the reworked patch on its own looks reasonable to me, thanks! (I wasn't too convinced about the iommu_cmd_line wrappers either...) Just iommu_get_dma_strict() needs an export since the SMMU drivers can be modular - I consciously didn't add that myself since I was mistakenly thinking only iommu-dma would call it. Fixed. Can I get your signoff for the patch? Then I'll switch it to over to being attributed to you. Sure - I would have thought that the one I originally posted still stands, but for the avoidance of doubt, for the parts of commit 8b6d45c495bd in your tree that remain from what I wrote: Signed-off-by: Robin Murphy Cheers, Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 14/17] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-11 08:26, Christoph Hellwig wrote: On Wed, Mar 10, 2021 at 06:39:57PM +, Robin Murphy wrote: Actually... Just mirroring the iommu_dma_strict value into struct iommu_domain should solve all of that with very little boilerplate code. Yes, my initial thought was to directly replace the attribute with a common flag at iommu_domain level, but since in all cases the behaviour is effectively global rather than actually per-domain, it seemed reasonable to take it a step further. This passes compile-testing for arm64 and x86, what do you think? It seems to miss a few bits, and also generally seems to be not actually apply to recent mainline or something like it due to different empty lines in a few places. Yeah, that was sketched out on top of some other development patches, and in being so focused on not breaking any of the x86 behaviours I did indeed overlook fully converting the SMMU drivers... oops! (my thought was to do the conversion for its own sake, then clean up the redundant attribute separately, but I guess it's fine either way) Let me know what you think of the version here: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iommu-cleanup I'll happily switch the patch to you as the author if you're fine with that as well. I still have reservations about removing the attribute API entirely and pretending that io_pgtable_cfg is anything other than a SoC-specific private interface, but the reworked patch on its own looks reasonable to me, thanks! (I wasn't too convinced about the iommu_cmd_line wrappers either...) Just iommu_get_dma_strict() needs an export since the SMMU drivers can be modular - I consciously didn't add that myself since I was mistakenly thinking only iommu-dma would call it. Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 14/17] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-10 09:25, Christoph Hellwig wrote: On Wed, Mar 10, 2021 at 10:15:01AM +0100, Christoph Hellwig wrote: On Thu, Mar 04, 2021 at 03:25:27PM +, Robin Murphy wrote: On 2021-03-01 08:42, Christoph Hellwig wrote: Use explicit methods for setting and querying the information instead. Now that everyone's using iommu-dma, is there any point in bouncing this through the drivers at all? Seems like it would make more sense for the x86 drivers to reflect their private options back to iommu_dma_strict (and allow Intel's caching mode to override it as well), then have iommu_dma_init_domain just test !iommu_dma_strict && domain->ops->flush_iotlb_all. Hmm. I looked at this, and kill off ->dma_enable_flush_queue for the ARM drivers and just looking at iommu_dma_strict seems like a very clear win. OTOH x86 is a little more complicated. AMD and intel defaul to lazy mode, so we'd have to change the global iommu_dma_strict if they are initialized. Also Intel has not only a "static" option to disable lazy mode, but also a "dynamic" one where it iterates structure. So I think on the get side we're stuck with the method, but it still simplifies the whole thing. Actually... Just mirroring the iommu_dma_strict value into struct iommu_domain should solve all of that with very little boilerplate code. Yes, my initial thought was to directly replace the attribute with a common flag at iommu_domain level, but since in all cases the behaviour is effectively global rather than actually per-domain, it seemed reasonable to take it a step further. This passes compile-testing for arm64 and x86, what do you think? Robin. ->8- Subject: [PATCH] iommu: Consolidate strict invalidation handling Now that everyone is using iommu-dma, the global invalidation policy really doesn't need to be woven through several parts of the core API and individual drivers, we can just look it up directly at the one point that we now make the flush queue decision. If the x86 drivers reflect their internal options and overrides back to iommu_dma_strict, that can become the canonical source. Signed-off-by: Robin Murphy --- drivers/iommu/amd/iommu.c | 2 ++ drivers/iommu/dma-iommu.c | 8 +--- drivers/iommu/intel/iommu.c | 12 drivers/iommu/iommu.c | 35 +++ include/linux/iommu.h | 2 ++ 5 files changed, 44 insertions(+), 15 deletions(-) diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index a69a8b573e40..1db29e59d468 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -1856,6 +1856,8 @@ int __init amd_iommu_init_dma_ops(void) else pr_info("Lazy IO/TLB flushing enabled\n"); + iommu_set_dma_strict(amd_iommu_unmap_flush); + return 0; } diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index af765c813cc8..789a950cc125 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -304,10 +304,6 @@ static void iommu_dma_flush_iotlb_all(struct iova_domain *iovad) cookie = container_of(iovad, struct iommu_dma_cookie, iovad); domain = cookie->fq_domain; - /* -* The IOMMU driver supporting DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE -* implies that ops->flush_iotlb_all must be non-NULL. -*/ domain->ops->flush_iotlb_all(domain); } @@ -334,7 +330,6 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, struct iommu_dma_cookie *cookie = domain->iova_cookie; unsigned long order, base_pfn; struct iova_domain *iovad; - int attr; if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE) return -EINVAL; @@ -371,8 +366,7 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, init_iova_domain(iovad, 1UL << order, base_pfn); if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) && - !iommu_domain_get_attr(domain, DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE, ) && - attr) { + domain->ops->flush_iotlb_all && !iommu_get_dma_strict()) { if (init_iova_flush_queue(iovad, iommu_dma_flush_iotlb_all, iommu_dma_entry_dtor)) pr_warn("iova flush queue initialization failed\n"); diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index b5c746f0f63b..f5b452cd1266 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -4377,6 +4377,17 @@ int __init intel_iommu_init(void) down_read(_global_lock); for_each_active_iommu(iommu, drhd) { + if (!intel_iommu_strict && cap_caching_mode(iommu->cap)) { + /* +* The flush queue implementation does not perform page-selective +
Re: [PATCH 16/17] iommu: remove DOMAIN_ATTR_IO_PGTABLE_CFG
On 2021-03-01 08:42, Christoph Hellwig wrote: Signed-off-by: Christoph Hellwig Moreso than the previous patch, where the feature is at least relatively generic (note that there's a bunch of in-flight development around DOMAIN_ATTR_NESTING), I'm really not convinced that it's beneficial to bloat the generic iommu_ops structure with private driver-specific interfaces. The attribute interface is a great compromise for these kinds of things, and you can easily add type-checked wrappers around it for external callers (maybe even make the actual attributes internal between the IOMMU core and drivers) if that's your concern. Robin. --- drivers/gpu/drm/msm/adreno/adreno_gpu.c | 2 +- drivers/iommu/arm/arm-smmu/arm-smmu.c | 40 +++-- drivers/iommu/iommu.c | 9 ++ include/linux/iommu.h | 9 +- 4 files changed, 29 insertions(+), 31 deletions(-) diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c index 0f184c3dd9d9ec..78d98ab2ee3a68 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c @@ -191,7 +191,7 @@ void adreno_set_llc_attributes(struct iommu_domain *iommu) struct io_pgtable_domain_attr pgtbl_cfg; pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_ARM_OUTER_WBWA; - iommu_domain_set_attr(iommu, DOMAIN_ATTR_IO_PGTABLE_CFG, _cfg); + iommu_domain_set_pgtable_attr(iommu, _cfg); } struct msm_gem_address_space * diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c index 2e17d990d04481..2858999c86dfd1 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c @@ -1515,40 +1515,22 @@ static int arm_smmu_domain_enable_nesting(struct iommu_domain *domain) return ret; } -static int arm_smmu_domain_set_attr(struct iommu_domain *domain, - enum iommu_attr attr, void *data) +static int arm_smmu_domain_set_pgtable_attr(struct iommu_domain *domain, + struct io_pgtable_domain_attr *pgtbl_cfg) { - int ret = 0; struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); + int ret = -EPERM; - mutex_lock(_domain->init_mutex); - - switch(domain->type) { - case IOMMU_DOMAIN_UNMANAGED: - switch (attr) { - case DOMAIN_ATTR_IO_PGTABLE_CFG: { - struct io_pgtable_domain_attr *pgtbl_cfg = data; - - if (smmu_domain->smmu) { - ret = -EPERM; - goto out_unlock; - } + if (domain->type != IOMMU_DOMAIN_UNMANAGED) + return -EINVAL; - smmu_domain->pgtbl_cfg = *pgtbl_cfg; - break; - } - default: - ret = -ENODEV; - } - break; - case IOMMU_DOMAIN_DMA: - ret = -ENODEV; - break; - default: - ret = -EINVAL; + mutex_lock(_domain->init_mutex); + if (!smmu_domain->smmu) { + smmu_domain->pgtbl_cfg = *pgtbl_cfg; + ret = 0; } -out_unlock: mutex_unlock(_domain->init_mutex); + return ret; } @@ -1609,7 +1591,7 @@ static struct iommu_ops arm_smmu_ops = { .device_group = arm_smmu_device_group, .dma_use_flush_queue= arm_smmu_dma_use_flush_queue, .dma_enable_flush_queue = arm_smmu_dma_enable_flush_queue, - .domain_set_attr= arm_smmu_domain_set_attr, + .domain_set_pgtable_attr = arm_smmu_domain_set_pgtable_attr, .domain_enable_nesting = arm_smmu_domain_enable_nesting, .of_xlate = arm_smmu_of_xlate, .get_resv_regions = arm_smmu_get_resv_regions, diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 2e9e058501a953..8490aefd4b41f8 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -2693,6 +2693,15 @@ int iommu_domain_enable_nesting(struct iommu_domain *domain) } EXPORT_SYMBOL_GPL(iommu_domain_enable_nesting); +int iommu_domain_set_pgtable_attr(struct iommu_domain *domain, + struct io_pgtable_domain_attr *pgtbl_cfg) +{ + if (!domain->ops->domain_set_pgtable_attr) + return -EINVAL; + return domain->ops->domain_set_pgtable_attr(domain, pgtbl_cfg); +} +EXPORT_SYMBOL_GPL(iommu_domain_set_pgtable_attr); + void iommu_get_resv_regions(struct device *dev, struct list_head *list) { const struct iommu_ops *ops = dev->bus->iommu_ops; diff --git a/include/linux/iommu.h b/include/linux/iommu.h index aed88aa3bd3edf..39d3ed4d2700ac 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -40,6 +40,7 @@ struct iommu_domain; struct notifier_block; struct iommu_sva; struct iommu_fault_event; +struct io_pgtable_domain_attr;
Re: [PATCH 14/17] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-01 08:42, Christoph Hellwig wrote: Use explicit methods for setting and querying the information instead. Now that everyone's using iommu-dma, is there any point in bouncing this through the drivers at all? Seems like it would make more sense for the x86 drivers to reflect their private options back to iommu_dma_strict (and allow Intel's caching mode to override it as well), then have iommu_dma_init_domain just test !iommu_dma_strict && domain->ops->flush_iotlb_all. Robin. Also remove the now unused iommu_domain_get_attr functionality. Signed-off-by: Christoph Hellwig --- drivers/iommu/amd/iommu.c | 23 ++--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 47 ++--- drivers/iommu/arm/arm-smmu/arm-smmu.c | 56 + drivers/iommu/dma-iommu.c | 8 ++- drivers/iommu/intel/iommu.c | 27 ++ drivers/iommu/iommu.c | 19 +++ include/linux/iommu.h | 17 ++- 7 files changed, 51 insertions(+), 146 deletions(-) diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index a69a8b573e40d0..37a8e51db17656 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -1771,24 +1771,11 @@ static struct iommu_group *amd_iommu_device_group(struct device *dev) return acpihid_device_group(dev); } -static int amd_iommu_domain_get_attr(struct iommu_domain *domain, - enum iommu_attr attr, void *data) +static bool amd_iommu_dma_use_flush_queue(struct iommu_domain *domain) { - switch (domain->type) { - case IOMMU_DOMAIN_UNMANAGED: - return -ENODEV; - case IOMMU_DOMAIN_DMA: - switch (attr) { - case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE: - *(int *)data = !amd_iommu_unmap_flush; - return 0; - default: - return -ENODEV; - } - break; - default: - return -EINVAL; - } + if (domain->type != IOMMU_DOMAIN_DMA) + return false; + return !amd_iommu_unmap_flush; } /* @@ -2257,7 +2244,7 @@ const struct iommu_ops amd_iommu_ops = { .release_device = amd_iommu_release_device, .probe_finalize = amd_iommu_probe_finalize, .device_group = amd_iommu_device_group, - .domain_get_attr = amd_iommu_domain_get_attr, + .dma_use_flush_queue = amd_iommu_dma_use_flush_queue, .get_resv_regions = amd_iommu_get_resv_regions, .put_resv_regions = generic_iommu_put_resv_regions, .is_attach_deferred = amd_iommu_is_attach_deferred, diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 8594b4a8304375..bf96172e8c1f71 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2449,33 +2449,21 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev) return group; } -static int arm_smmu_domain_get_attr(struct iommu_domain *domain, - enum iommu_attr attr, void *data) +static bool arm_smmu_dma_use_flush_queue(struct iommu_domain *domain) { struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); - switch (domain->type) { - case IOMMU_DOMAIN_UNMANAGED: - switch (attr) { - case DOMAIN_ATTR_NESTING: - *(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED); - return 0; - default: - return -ENODEV; - } - break; - case IOMMU_DOMAIN_DMA: - switch (attr) { - case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE: - *(int *)data = smmu_domain->non_strict; - return 0; - default: - return -ENODEV; - } - break; - default: - return -EINVAL; - } + if (domain->type != IOMMU_DOMAIN_DMA) + return false; + return smmu_domain->non_strict; +} + + +static void arm_smmu_dma_enable_flush_queue(struct iommu_domain *domain) +{ + if (domain->type != IOMMU_DOMAIN_DMA) + return; + to_smmu_domain(domain)->non_strict = true; } static int arm_smmu_domain_set_attr(struct iommu_domain *domain, @@ -2505,13 +2493,7 @@ static int arm_smmu_domain_set_attr(struct iommu_domain *domain, } break; case IOMMU_DOMAIN_DMA: - switch(attr) { - case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE: - smmu_domain->non_strict = *(int *)data; - break; - default: - ret =
Re: [PATCH 0/8] Convert the intel iommu driver to the dma-iommu api
Hi Tom, On 2019-12-21 15:03, Tom Murphy wrote: This patchset converts the intel iommu driver to the dma-iommu api. While converting the driver I exposed a bug in the intel i915 driver which causes a huge amount of artifacts on the screen of my laptop. You can see a picture of it here: https://github.com/pippy360/kernelPatches/blob/master/IMG_20191219_225922.jpg This issue is most likely in the i915 driver and is most likely caused by the driver not respecting the return value of the dma_map_ops::map_sg function. You can see the driver ignoring the return value here: https://github.com/torvalds/linux/blob/7e0165b2f1a912a06e381e91f0f4e495f4ac3736/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c#L51 Previously this didn’t cause issues because the intel map_sg always returned the same number of elements as the input scatter gather list but with the change to this dma-iommu api this is no longer the case. I wasn’t able to track the bug down to a specific line of code unfortunately. Could someone from the intel team look at this? I have been testing on a lenovo x1 carbon 5th generation. Let me know if there’s any more information you need. To allow my patch set to be tested I have added a patch (patch 8/8) in this series to disable combining sg segments in the dma-iommu api which fixes the bug but it doesn't fix the actual problem. As part of this patch series I copied the intel bounce buffer code to the dma-iommu path. The addition of the bounce buffer code took me by surprise. I did most of my development on this patch series before the bounce buffer code was added and my reimplementation in the dma-iommu path is very rushed and not properly tested but I’m running out of time to work on this patch set. On top of that I also didn’t port over the intel tracing code from this commit: https://github.com/torvalds/linux/commit/3b53034c268d550d9e8522e613a14ab53b8840d8#diff-6b3e7c4993f05e76331e463ab1fc87e1 So all the work in that commit is now wasted. The code will need to be removed and reimplemented in the dma-iommu path. I would like to take the time to do this but I really don’t have the time at the moment and I want to get these changes out before the iommu code changes any more. Further to what we just discussed at LPC, I've realised that tracepoints are actually something I could do with *right now* for debugging my Arm DMA ops series, so if I'm going to hack something up anyway I may as well take responsibility for polishing it into a proper patch as well :) Robin. Tom Murphy (8): iommu/vt-d: clean up 32bit si_domain assignment iommu/vt-d: Use default dma_direct_* mapping functions for direct mapped devices iommu/vt-d: Remove IOVA handling code from non-dma_ops path iommu: Handle freelists when using deferred flushing in iommu drivers iommu: Add iommu_dma_free_cpu_cached_iovas function iommu: allow the dma-iommu api to use bounce buffers iommu/vt-d: Convert intel iommu driver to the iommu ops DO NOT MERGE: iommu: disable list appending in dma-iommu drivers/iommu/Kconfig | 1 + drivers/iommu/amd_iommu.c | 14 +- drivers/iommu/arm-smmu-v3.c | 3 +- drivers/iommu/arm-smmu.c| 3 +- drivers/iommu/dma-iommu.c | 183 +-- drivers/iommu/exynos-iommu.c| 3 +- drivers/iommu/intel-iommu.c | 936 drivers/iommu/iommu.c | 39 +- drivers/iommu/ipmmu-vmsa.c | 3 +- drivers/iommu/msm_iommu.c | 3 +- drivers/iommu/mtk_iommu.c | 3 +- drivers/iommu/mtk_iommu_v1.c| 3 +- drivers/iommu/omap-iommu.c | 3 +- drivers/iommu/qcom_iommu.c | 3 +- drivers/iommu/rockchip-iommu.c | 3 +- drivers/iommu/s390-iommu.c | 3 +- drivers/iommu/tegra-gart.c | 3 +- drivers/iommu/tegra-smmu.c | 3 +- drivers/iommu/virtio-iommu.c| 3 +- drivers/vfio/vfio_iommu_type1.c | 2 +- include/linux/dma-iommu.h | 3 + include/linux/intel-iommu.h | 1 - include/linux/iommu.h | 32 +- 23 files changed, 345 insertions(+), 908 deletions(-) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V2 1/2] Add new flush_iotlb_range and handle freelists when using iommu_unmap_fast
On 2020-08-18 07:04, Tom Murphy wrote: Add a flush_iotlb_range to allow flushing of an iova range instead of a full flush in the dma-iommu path. Allow the iommu_unmap_fast to return newly freed page table pages and pass the freelist to queue_iova in the dma-iommu ops path. This patch is useful for iommu drivers (in this case the intel iommu driver) which need to wait for the ioTLB to be flushed before newly free/unmapped page table pages can be freed. This way we can still batch ioTLB free operations and handle the freelists. It sounds like the freelist is something that logically belongs in the iommu_iotlb_gather structure. And even if it's not a perfect fit I'd be inclined to jam it in there anyway just to avoid this giant argument explosion ;) Why exactly do we need to introduce a new flush_iotlb_range() op? Can't the AMD driver simply use the gather mechanism like everyone else? Robin. Change-log: V2: -fix missing parameter in mtk_iommu_v1.c Signed-off-by: Tom Murphy --- drivers/iommu/amd/iommu.c | 14 - drivers/iommu/arm-smmu-v3.c | 3 +- drivers/iommu/arm-smmu.c| 3 +- drivers/iommu/dma-iommu.c | 45 --- drivers/iommu/exynos-iommu.c| 3 +- drivers/iommu/intel/iommu.c | 54 + drivers/iommu/iommu.c | 25 +++ drivers/iommu/ipmmu-vmsa.c | 3 +- drivers/iommu/msm_iommu.c | 3 +- drivers/iommu/mtk_iommu.c | 3 +- drivers/iommu/mtk_iommu_v1.c| 3 +- drivers/iommu/omap-iommu.c | 3 +- drivers/iommu/qcom_iommu.c | 3 +- drivers/iommu/rockchip-iommu.c | 3 +- drivers/iommu/s390-iommu.c | 3 +- drivers/iommu/sun50i-iommu.c| 3 +- drivers/iommu/tegra-gart.c | 3 +- drivers/iommu/tegra-smmu.c | 3 +- drivers/iommu/virtio-iommu.c| 3 +- drivers/vfio/vfio_iommu_type1.c | 2 +- include/linux/iommu.h | 21 +++-- 21 files changed, 150 insertions(+), 56 deletions(-) diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index 2f22326ee4df..25fbacab23c3 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -2513,7 +2513,8 @@ static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova, static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova, size_t page_size, - struct iommu_iotlb_gather *gather) + struct iommu_iotlb_gather *gather, + struct page **freelist) { struct protection_domain *domain = to_pdomain(dom); struct domain_pgtable pgtable; @@ -2636,6 +2637,16 @@ static void amd_iommu_flush_iotlb_all(struct iommu_domain *domain) spin_unlock_irqrestore(>lock, flags); } +static void amd_iommu_flush_iotlb_range(struct iommu_domain *domain, + unsigned long iova, size_t size, + struct page *freelist) +{ + struct protection_domain *dom = to_pdomain(domain); + + domain_flush_pages(dom, iova, size); + domain_flush_complete(dom); +} + static void amd_iommu_iotlb_sync(struct iommu_domain *domain, struct iommu_iotlb_gather *gather) { @@ -2675,6 +2686,7 @@ const struct iommu_ops amd_iommu_ops = { .is_attach_deferred = amd_iommu_is_attach_deferred, .pgsize_bitmap = AMD_IOMMU_PGSIZES, .flush_iotlb_all = amd_iommu_flush_iotlb_all, + .flush_iotlb_range = amd_iommu_flush_iotlb_range, .iotlb_sync = amd_iommu_iotlb_sync, .def_domain_type = amd_iommu_def_domain_type, }; diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index f578677a5c41..8d328dc25326 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -2854,7 +2854,8 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova, } static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova, -size_t size, struct iommu_iotlb_gather *gather) +size_t size, struct iommu_iotlb_gather *gather, +struct page **freelist) { struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops; diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index 243bc4cb2705..0cd0dfc89875 100644 --- a/drivers/iommu/arm-smmu.c +++ b/drivers/iommu/arm-smmu.c @@ -1234,7 +1234,8 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova, } static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova, -size_t size, struct iommu_iotlb_gather *gather) +size_t size, struct iommu_iotlb_gather *gather, +struct page **freelist)
Re: [PATCH v3 00/34] iommu: Move iommu_group setup to IOMMU core code
On 2020-07-01 01:40, Qian Cai wrote: Looks like this patchset introduced an use-after-free on arm-smmu-v3. Reproduced using mlx5, # echo 1 > /sys/class/net/enp11s0f1np1/device/sriov_numvfs # echo 0 > /sys/class/net/enp11s0f1np1/device/sriov_numvfs The .config, https://github.com/cailca/linux-mm/blob/master/arm64.config Looking at the free stack, iommu_release_device->iommu_group_remove_device was introduced in 07/34 ("iommu: Add probe_device() and release_device() call-backs"). Right, iommu_group_remove_device can tear down the group and call ->domain_free before the driver has any knowledge of the last device going away via the ->release_device call. I guess the question is do we simply flip the call order in iommu_release_device() so drivers can easily clean up their internal per-device state first, or do we now want them to be robust against freeing domains with devices still nominally attached? Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 17/34] iommu/arm-smmu: Store device instead of group in arm_smmu_s2cr
On 2020-04-08 3:37 pm, Joerg Roedel wrote: Hi Robin, thanks for looking into this. On Wed, Apr 08, 2020 at 01:09:40PM +0100, Robin Murphy wrote: For a hot-pluggable bus where logical devices may share Stream IDs (like fsl-mc), this could happen: create device A iommu_probe_device(A) iommu_device_group(A) -> alloc group X create device B iommu_probe_device(B) iommu_device_group(A) -> lookup returns group X ... iommu_remove_device(A) delete device A create device C iommu_probe_device(C) iommu_device_group(C) -> use-after-free of A Preserving the logical behaviour here would probably look *something* like the mangled diff below, but I haven't thought it through 100%. Yeah, I think you are right. How about just moving the loop which sets s2crs[idx].group to arm_smmu_device_group()? In that case I can drop this patch and leave the group pointer in place. Isn't that exactly what I suggested? :) I don't recall for sure, but knowing me, that bit of group bookkeeping is only where it currently is because it cheekily saves iterating the IDs a second time. I don't think there's any technical reason. Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 17/34] iommu/arm-smmu: Store device instead of group in arm_smmu_s2cr
On 2020-04-07 7:37 pm, Joerg Roedel wrote: From: Joerg Roedel This is required to convert the arm-smmu driver to the probe/release_device() interface. Signed-off-by: Joerg Roedel --- drivers/iommu/arm-smmu.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index a6a5796e9c41..3493501d8b2c 100644 --- a/drivers/iommu/arm-smmu.c +++ b/drivers/iommu/arm-smmu.c @@ -69,7 +69,7 @@ MODULE_PARM_DESC(disable_bypass, "Disable bypass streams such that incoming transactions from devices that are not attached to an iommu domain will report an abort back to the device and will not be allowed to pass through the SMMU."); struct arm_smmu_s2cr { - struct iommu_group *group; + struct device *dev; int count; enum arm_smmu_s2cr_type type; enum arm_smmu_s2cr_privcfg privcfg; @@ -1100,7 +1100,7 @@ static int arm_smmu_master_alloc_smes(struct device *dev) /* It worked! Now, poke the actual hardware */ for_each_cfg_sme(cfg, fwspec, i, idx) { arm_smmu_write_sme(smmu, idx); - smmu->s2crs[idx].group = group; + smmu->s2crs[idx].dev = dev; } mutex_unlock(>stream_map_mutex); @@ -1495,11 +1495,15 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev) int i, idx; for_each_cfg_sme(cfg, fwspec, i, idx) { - if (group && smmu->s2crs[idx].group && - group != smmu->s2crs[idx].group) + struct iommu_group *idx_grp = NULL; + + if (smmu->s2crs[idx].dev) + idx_grp = smmu->s2crs[idx].dev->iommu_group; For a hot-pluggable bus where logical devices may share Stream IDs (like fsl-mc), this could happen: create device A iommu_probe_device(A) iommu_device_group(A) -> alloc group X create device B iommu_probe_device(B) iommu_device_group(A) -> lookup returns group X ... iommu_remove_device(A) delete device A create device C iommu_probe_device(C) iommu_device_group(C) -> use-after-free of A Preserving the logical behaviour here would probably look *something* like the mangled diff below, but I haven't thought it through 100%. Robin. ->8- diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index 16c4b87af42b..e88612ee47fe 100644 --- a/drivers/iommu/arm-smmu.c +++ b/drivers/iommu/arm-smmu.c @@ -1100,10 +1100,8 @@ static int arm_smmu_master_alloc_smes(struct device *dev) iommu_group_put(group); /* It worked! Now, poke the actual hardware */ - for_each_cfg_sme(fwspec, i, idx) { + for_each_cfg_sme(fwspec, i, idx) arm_smmu_write_sme(smmu, idx); - smmu->s2crs[idx].group = group; - } mutex_unlock(>stream_map_mutex); return 0; @@ -1500,15 +1498,17 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev) } if (group) - return iommu_group_ref_get(group); - - if (dev_is_pci(dev)) + iommu_group_ref_get(group); + else if (dev_is_pci(dev)) group = pci_device_group(dev); else if (dev_is_fsl_mc(dev)) group = fsl_mc_device_group(dev); else group = generic_device_group(dev); + for_each_cfg_sme(fwspec, i, idx) + smmu->s2crs[idx].group = group; + return group; } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH v2] iommu/virtio: Use page size bitmap supported by endpoint
On 2020-04-01 12:38 pm, Bharat Bhushan wrote: Different endpoint can support different page size, probe endpoint if it supports specific page size otherwise use global page sizes. Signed-off-by: Bharat Bhushan --- drivers/iommu/virtio-iommu.c | 33 +++ include/uapi/linux/virtio_iommu.h | 7 +++ 2 files changed, 36 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index cce329d71fba..c794cb5b7b3e 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -78,6 +78,7 @@ struct viommu_endpoint { struct viommu_dev *viommu; struct viommu_domain*vdomain; struct list_headresv_regions; + u64 pgsize_bitmap; }; struct viommu_request { @@ -415,6 +416,20 @@ static int viommu_replay_mappings(struct viommu_domain *vdomain) return ret; } +static int viommu_set_pgsize_bitmap(struct viommu_endpoint *vdev, + struct virtio_iommu_probe_pgsize_mask *mask, + size_t len) + +{ + u64 pgsize_bitmap = le64_to_cpu(mask->pgsize_bitmap); + + if (len < sizeof(*mask)) + return -EINVAL; + + vdev->pgsize_bitmap = pgsize_bitmap; + return 0; +} + static int viommu_add_resv_mem(struct viommu_endpoint *vdev, struct virtio_iommu_probe_resv_mem *mem, size_t len) @@ -494,11 +509,13 @@ static int viommu_probe_endpoint(struct viommu_dev *viommu, struct device *dev) while (type != VIRTIO_IOMMU_PROBE_T_NONE && cur < viommu->probe_size) { len = le16_to_cpu(prop->length) + sizeof(*prop); - switch (type) { case VIRTIO_IOMMU_PROBE_T_RESV_MEM: ret = viommu_add_resv_mem(vdev, (void *)prop, len); break; + case VIRTIO_IOMMU_PROBE_T_PAGE_SIZE_MASK: + ret = viommu_set_pgsize_bitmap(vdev, (void *)prop, len); + break; default: dev_err(dev, "unknown viommu prop 0x%x\n", type); } @@ -607,16 +624,23 @@ static struct iommu_domain *viommu_domain_alloc(unsigned type) return >domain; } -static int viommu_domain_finalise(struct viommu_dev *viommu, +static int viommu_domain_finalise(struct viommu_endpoint *vdev, struct iommu_domain *domain) { int ret; struct viommu_domain *vdomain = to_viommu_domain(domain); + struct viommu_dev *viommu = vdev->viommu; vdomain->viommu = viommu; vdomain->map_flags = viommu->map_flags; - domain->pgsize_bitmap = viommu->pgsize_bitmap; + /* Devices in same domain must support same size pages */ AFAICS what the code appears to do is enforce that the first endpoint attached to any domain has the same pgsize_bitmap as the most recently probed viommu_dev instance, then ignore any subsequent endpoints attached to the same domain. Thus I'm not sure that comment is accurate. Robin. + if ((domain->pgsize_bitmap != viommu->pgsize_bitmap) && + (domain->pgsize_bitmap != vdev->pgsize_bitmap)) + return -EINVAL; + + domain->pgsize_bitmap = vdev->pgsize_bitmap; + domain->geometry = viommu->geometry; ret = ida_alloc_range(>domain_ids, viommu->first_domain, @@ -657,7 +681,7 @@ static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev) * Properly initialize the domain now that we know which viommu * owns it. */ - ret = viommu_domain_finalise(vdev->viommu, domain); + ret = viommu_domain_finalise(vdev, domain); } else if (vdomain->viommu != vdev->viommu) { dev_err(dev, "cannot attach to foreign vIOMMU\n"); ret = -EXDEV; @@ -875,6 +899,7 @@ static int viommu_add_device(struct device *dev) vdev->dev = dev; vdev->viommu = viommu; + vdev->pgsize_bitmap = viommu->pgsize_bitmap; INIT_LIST_HEAD(>resv_regions); fwspec->iommu_priv = vdev; diff --git a/include/uapi/linux/virtio_iommu.h b/include/uapi/linux/virtio_iommu.h index 237e36a280cb..dc9d3f40bcd8 100644 --- a/include/uapi/linux/virtio_iommu.h +++ b/include/uapi/linux/virtio_iommu.h @@ -111,6 +111,7 @@ struct virtio_iommu_req_unmap { #define VIRTIO_IOMMU_PROBE_T_NONE 0 #define VIRTIO_IOMMU_PROBE_T_RESV_MEM 1 +#define VIRTIO_IOMMU_PROBE_T_PAGE_SIZE_MASK2 #define VIRTIO_IOMMU_PROBE_T_MASK 0xfff @@ -119,6 +120,12 @@ struct virtio_iommu_probe_property { __le16 length; }; +struct virtio_iommu_probe_pgsize_mask { + struct virtio_iommu_probe_property head; +
Re: [PATCH v2 3/3] iommu/virtio: Reject IOMMU page granule larger than PAGE_SIZE
On 2020-03-26 9:35 am, Jean-Philippe Brucker wrote: We don't currently support IOMMUs with a page granule larger than the system page size. The IOVA allocator has a BUG_ON() in this case, and VFIO has a WARN_ON(). Removing these obstacles ranges doesn't seem possible without major changes to the DMA API and VFIO. Some callers of iommu_map(), for example, want to map multiple page-aligned regions adjacent to each others for scatter-gather purposes. Even in simple DMA API uses, a call to dma_map_page() would let the endpoint access neighbouring memory. And VFIO users cannot ensure that their virtual address buffer is physically contiguous at the IOMMU granule. Rather than triggering the IOVA BUG_ON() on mismatched page sizes, abort the vdomain finalise() with an error message. We could simply abort the viommu probe(), but an upcoming extension to virtio-iommu will allow setting different page masks for each endpoint. Reviewed-by: Robin Murphy Reported-by: Bharat Bhushan Signed-off-by: Jean-Philippe Brucker --- v1->v2: Move to vdomain_finalise(), improve commit message --- drivers/iommu/virtio-iommu.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 5eed75cd121f..750f69c49b95 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -607,12 +607,22 @@ static struct iommu_domain *viommu_domain_alloc(unsigned type) return >domain; } -static int viommu_domain_finalise(struct viommu_dev *viommu, +static int viommu_domain_finalise(struct viommu_endpoint *vdev, struct iommu_domain *domain) { int ret; + unsigned long viommu_page_size; + struct viommu_dev *viommu = vdev->viommu; struct viommu_domain *vdomain = to_viommu_domain(domain); + viommu_page_size = 1UL << __ffs(viommu->pgsize_bitmap); + if (viommu_page_size > PAGE_SIZE) { + dev_err(vdev->dev, + "granule 0x%lx larger than system page size 0x%lx\n", + viommu_page_size, PAGE_SIZE); + return -EINVAL; + } + ret = ida_alloc_range(>domain_ids, viommu->first_domain, viommu->last_domain, GFP_KERNEL); if (ret < 0) @@ -659,7 +669,7 @@ static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev) * Properly initialize the domain now that we know which viommu * owns it. */ - ret = viommu_domain_finalise(vdev->viommu, domain); + ret = viommu_domain_finalise(vdev, domain); } else if (vdomain->viommu != vdev->viommu) { dev_err(dev, "cannot attach to foreign vIOMMU\n"); ret = -EXDEV; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 2/3] iommu/virtio: Fix freeing of incomplete domains
On 2020-03-26 9:35 am, Jean-Philippe Brucker wrote: Calling viommu_domain_free() on a domain that hasn't been finalised (not attached to any device, for example) can currently cause an Oops, because we attempt to call ida_free() on ID 0, which may either be unallocated or used by another domain. Only initialise the vdomain->viommu pointer, which denotes a finalised domain, at the end of a successful viommu_domain_finalise(). Reviewed-by: Robin Murphy Fixes: edcd69ab9a32 ("iommu: Add virtio-iommu driver") Reported-by: Eric Auger Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/virtio-iommu.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index cce329d71fba..5eed75cd121f 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -613,18 +613,20 @@ static int viommu_domain_finalise(struct viommu_dev *viommu, int ret; struct viommu_domain *vdomain = to_viommu_domain(domain); - vdomain->viommu = viommu; - vdomain->map_flags = viommu->map_flags; + ret = ida_alloc_range(>domain_ids, viommu->first_domain, + viommu->last_domain, GFP_KERNEL); + if (ret < 0) + return ret; + + vdomain->id = (unsigned int)ret; domain->pgsize_bitmap = viommu->pgsize_bitmap; domain->geometry = viommu->geometry; - ret = ida_alloc_range(>domain_ids, viommu->first_domain, - viommu->last_domain, GFP_KERNEL); - if (ret >= 0) - vdomain->id = (unsigned int)ret; + vdomain->map_flags = viommu->map_flags; + vdomain->viommu = viommu; - return ret > 0 ? 0 : ret; + return 0; } static void viommu_domain_free(struct iommu_domain *domain) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] iommu/virtio: Reject IOMMU page granule larger than PAGE_SIZE
On 2020-03-18 4:14 pm, Auger Eric wrote: Hi, On 3/18/20 1:00 PM, Robin Murphy wrote: On 2020-03-18 11:40 am, Jean-Philippe Brucker wrote: We don't currently support IOMMUs with a page granule larger than the system page size. The IOVA allocator has a BUG_ON() in this case, and VFIO has a WARN_ON(). Adding Alex in CC in case he has time to jump in. At the moment I don't get why this WARN_ON() is here. This was introduced in c8dbca165bb090f926996a572ea2b5b577b34b70 vfio/iommu_type1: Avoid overflow It might be possible to remove these obstacles if necessary. If the host uses 64kB pages and the guest uses 4kB, then a device driver calling alloc_page() followed by dma_map_page() will create a 64kB mapping for a 4kB physical page, allowing the endpoint to access the neighbouring 60kB of memory. This problem could be worked around with bounce buffers. FWIW the fundamental issue is that callers of iommu_map() may expect to be able to map two or more page-aligned regions directly adjacent to each other for scatter-gather purposes (or ring buffer tricks), and that's just not possible if the IOMMU granule is too big. Bounce buffering would be a viable workaround for the streaming DMA API and certain similar use-cases, but not in general (e.g. coherent DMA, VFIO, GPUs, etc.) Robin. For the moment, rather than triggering the IOVA BUG_ON() on mismatched page sizes, abort the virtio-iommu probe with an error message. I understand this is a introduced as a temporary solution but this sounds as an important limitation to me. For instance this will prevent from running a fedora guest exposed with a virtio-iommu with a RHEL host. As above, even if you bypassed all the warnings it wouldn't really work properly anyway. In all cases that wouldn't be considered broken, the underlying hardware IOMMUs should support the same set of granules as the CPUs (or at least the smallest one), so is it actually appropriate for RHEL to (presumably) expose a 64K granule in the first place, rather than "works with anything" 4K? And/or more generally is there perhaps a hole in the virtio-iommu spec WRT being able to negotiate page_size_mask for a particular granule if multiple options are available? Robin. Thanks Eric Reported-by: Bharat Bhushan Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/virtio-iommu.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 6d4e3c2a2ddb..80d5d8f621ab 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -998,6 +998,7 @@ static int viommu_probe(struct virtio_device *vdev) struct device *parent_dev = vdev->dev.parent; struct viommu_dev *viommu = NULL; struct device *dev = >dev; + unsigned long viommu_page_size; u64 input_start = 0; u64 input_end = -1UL; int ret; @@ -1028,6 +1029,14 @@ static int viommu_probe(struct virtio_device *vdev) goto err_free_vqs; } + viommu_page_size = 1UL << __ffs(viommu->pgsize_bitmap); + if (viommu_page_size > PAGE_SIZE) { + dev_err(dev, "granule 0x%lx larger than system page size 0x%lx\n", + viommu_page_size, PAGE_SIZE); + ret = -EINVAL; + goto err_free_vqs; + } + viommu->map_flags = VIRTIO_IOMMU_MAP_F_READ | VIRTIO_IOMMU_MAP_F_WRITE; viommu->last_domain = ~0U; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] iommu/virtio: Reject IOMMU page granule larger than PAGE_SIZE
On 2020-03-18 11:40 am, Jean-Philippe Brucker wrote: We don't currently support IOMMUs with a page granule larger than the system page size. The IOVA allocator has a BUG_ON() in this case, and VFIO has a WARN_ON(). It might be possible to remove these obstacles if necessary. If the host uses 64kB pages and the guest uses 4kB, then a device driver calling alloc_page() followed by dma_map_page() will create a 64kB mapping for a 4kB physical page, allowing the endpoint to access the neighbouring 60kB of memory. This problem could be worked around with bounce buffers. FWIW the fundamental issue is that callers of iommu_map() may expect to be able to map two or more page-aligned regions directly adjacent to each other for scatter-gather purposes (or ring buffer tricks), and that's just not possible if the IOMMU granule is too big. Bounce buffering would be a viable workaround for the streaming DMA API and certain similar use-cases, but not in general (e.g. coherent DMA, VFIO, GPUs, etc.) Robin. For the moment, rather than triggering the IOVA BUG_ON() on mismatched page sizes, abort the virtio-iommu probe with an error message. Reported-by: Bharat Bhushan Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/virtio-iommu.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 6d4e3c2a2ddb..80d5d8f621ab 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -998,6 +998,7 @@ static int viommu_probe(struct virtio_device *vdev) struct device *parent_dev = vdev->dev.parent; struct viommu_dev *viommu = NULL; struct device *dev = >dev; + unsigned long viommu_page_size; u64 input_start = 0; u64 input_end = -1UL; int ret; @@ -1028,6 +1029,14 @@ static int viommu_probe(struct virtio_device *vdev) goto err_free_vqs; } + viommu_page_size = 1UL << __ffs(viommu->pgsize_bitmap); + if (viommu_page_size > PAGE_SIZE) { + dev_err(dev, "granule 0x%lx larger than system page size 0x%lx\n", + viommu_page_size, PAGE_SIZE); + ret = -EINVAL; + goto err_free_vqs; + } + viommu->map_flags = VIRTIO_IOMMU_MAP_F_READ | VIRTIO_IOMMU_MAP_F_WRITE; viommu->last_domain = ~0U; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] iommu/virtio: Enable x86 support
On 17/02/2020 1:31 pm, Michael S. Tsirkin wrote: On Mon, Feb 17, 2020 at 01:22:44PM +, Robin Murphy wrote: On 17/02/2020 1:01 pm, Michael S. Tsirkin wrote: On Mon, Feb 17, 2020 at 10:01:07AM +0100, Jean-Philippe Brucker wrote: On Sun, Feb 16, 2020 at 04:50:33AM -0500, Michael S. Tsirkin wrote: On Fri, Feb 14, 2020 at 04:57:11PM +, Robin Murphy wrote: On 14/02/2020 4:04 pm, Jean-Philippe Brucker wrote: With the built-in topology description in place, x86 platforms can now use the virtio-iommu. Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/Kconfig | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index 068d4e0e3541..adcbda44d473 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -508,8 +508,9 @@ config HYPERV_IOMMU config VIRTIO_IOMMU bool "Virtio IOMMU driver" depends on VIRTIO=y - depends on ARM64 + depends on (ARM64 || X86) select IOMMU_API + select IOMMU_DMA Can that have an "if X86" for clarity? AIUI it's not necessary for virtio-iommu itself (and really shouldn't be), but is merely to satisfy the x86 arch code's expectation that IOMMU drivers bring their own DMA ops, right? Robin. In fact does not this work on any platform now? There is ongoing work to use the generic IOMMU_DMA ops on X86. AMD IOMMU has been converted recently [1] but VT-d still implements its own DMA ops (conversion patches are on the list [2]). On Arm the arch Kconfig selects IOMMU_DMA, and I assume we'll have the same on X86 once Tom's work is complete. Until then I can add a "if X86" here for clarity. Thanks, Jean [1] https://lore.kernel.org/linux-iommu/20190613223901.9523-1-murph...@tcd.ie/ [2] https://lore.kernel.org/linux-iommu/20191221150402.13868-1-murph...@tcd.ie/ What about others? E.g. PPC? That was the point I was getting at - while iommu-dma should build just fine for the likes of PPC, s390, 32-bit Arm, etc., they have no architecture code to correctly wire up iommu_dma_ops to devices. Thus there's currently no point pulling it in and pretending it's anything more than a waste of space for architectures other than arm64 and x86. It's merely a historical artefact of the x86 DMA API implementation that when the IOMMU drivers were split out to form drivers/iommu they took some of their relevant arch code with them. Robin. Rather than white-listing architectures, how about making the architectures in question set some kind of symbol, and depend on it? Umm, that's basically what we have already? Architectures that use iommu_dma_ops select IOMMU_DMA. The only issue is the oddity of x86 treating IOMMU drivers as part of its arch code, which has never come up against a cross-architecture driver until now. Hence the options of either maintaining that paradigm and having the 'x86 arch code' aspect of this driver "select IOMMU_DMA if x86" such that it works out equivalent to AMD_IOMMU, or a more involved cleanup to move that responsibility out of drivers/iommu/Kconfig entirely and have arch/x86/Kconfig do something like "select IOMMU_DMA if IOMMU_API", as Jean suggested up-thread. In the specific context of IOMMU_DMA we're not talking about any kind of white-list, merely a one-off special case for one particular architecture. Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] iommu/virtio: Enable x86 support
On 17/02/2020 1:01 pm, Michael S. Tsirkin wrote: On Mon, Feb 17, 2020 at 10:01:07AM +0100, Jean-Philippe Brucker wrote: On Sun, Feb 16, 2020 at 04:50:33AM -0500, Michael S. Tsirkin wrote: On Fri, Feb 14, 2020 at 04:57:11PM +, Robin Murphy wrote: On 14/02/2020 4:04 pm, Jean-Philippe Brucker wrote: With the built-in topology description in place, x86 platforms can now use the virtio-iommu. Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/Kconfig | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index 068d4e0e3541..adcbda44d473 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -508,8 +508,9 @@ config HYPERV_IOMMU config VIRTIO_IOMMU bool "Virtio IOMMU driver" depends on VIRTIO=y - depends on ARM64 + depends on (ARM64 || X86) select IOMMU_API + select IOMMU_DMA Can that have an "if X86" for clarity? AIUI it's not necessary for virtio-iommu itself (and really shouldn't be), but is merely to satisfy the x86 arch code's expectation that IOMMU drivers bring their own DMA ops, right? Robin. In fact does not this work on any platform now? There is ongoing work to use the generic IOMMU_DMA ops on X86. AMD IOMMU has been converted recently [1] but VT-d still implements its own DMA ops (conversion patches are on the list [2]). On Arm the arch Kconfig selects IOMMU_DMA, and I assume we'll have the same on X86 once Tom's work is complete. Until then I can add a "if X86" here for clarity. Thanks, Jean [1] https://lore.kernel.org/linux-iommu/20190613223901.9523-1-murph...@tcd.ie/ [2] https://lore.kernel.org/linux-iommu/20191221150402.13868-1-murph...@tcd.ie/ What about others? E.g. PPC? That was the point I was getting at - while iommu-dma should build just fine for the likes of PPC, s390, 32-bit Arm, etc., they have no architecture code to correctly wire up iommu_dma_ops to devices. Thus there's currently no point pulling it in and pretending it's anything more than a waste of space for architectures other than arm64 and x86. It's merely a historical artefact of the x86 DMA API implementation that when the IOMMU drivers were split out to form drivers/iommu they took some of their relevant arch code with them. Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] PCI: Add DMA configuration for virtual platforms
On 14/02/2020 4:04 pm, Jean-Philippe Brucker wrote: Hardware platforms usually describe the IOMMU topology using either device-tree pointers or vendor-specific ACPI tables. For virtual platforms that don't provide a device-tree, the virtio-iommu device contains a description of the endpoints it manages. That information allows us to probe endpoints after the IOMMU is probed (possibly as late as userspace modprobe), provided it is discovered early enough. Add a hook to pci_dma_configure(), which returns -EPROBE_DEFER if the endpoint is managed by a vIOMMU that will be loaded later, or 0 in any other case to avoid disturbing the normal DMA configuration methods. When CONFIG_VIRTIO_IOMMU_TOPOLOGY isn't selected, the call to virt_dma_configure() is compiled out. As long as the information is consistent, platforms can provide both a device-tree and a built-in topology, and the IOMMU infrastructure is able to deal with multiple DMA configuration methods. Urgh, it's already been established[1] that having IOMMU setup tied to DMA configuration at driver probe time is not just conceptually wrong but actually broken, so the concept here worries me a bit. In a world where of_iommu_configure() and friends are being called much earlier around iommu_probe_device() time, how badly will this fall apart? Robin. [1] https://lore.kernel.org/linux-iommu/9625faf4-48ef-2dd3-d82f-931d9cf26...@huawei.com/ Signed-off-by: Jean-Philippe Brucker --- drivers/pci/pci-driver.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 0454ca0e4e3f..69303a814f21 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -18,6 +18,7 @@ #include #include #include +#include #include "pci.h" #include "pcie/portdrv.h" @@ -1602,6 +1603,10 @@ static int pci_dma_configure(struct device *dev) struct device *bridge; int ret = 0; + ret = virt_dma_configure(dev); + if (ret) + return ret; + bridge = pci_get_host_bridge_device(to_pci_dev(dev)); if (IS_ENABLED(CONFIG_OF) && bridge->parent && ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] iommu/virtio: Enable x86 support
On 14/02/2020 4:04 pm, Jean-Philippe Brucker wrote: With the built-in topology description in place, x86 platforms can now use the virtio-iommu. Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/Kconfig | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index 068d4e0e3541..adcbda44d473 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -508,8 +508,9 @@ config HYPERV_IOMMU config VIRTIO_IOMMU bool "Virtio IOMMU driver" depends on VIRTIO=y - depends on ARM64 + depends on (ARM64 || X86) select IOMMU_API + select IOMMU_DMA Can that have an "if X86" for clarity? AIUI it's not necessary for virtio-iommu itself (and really shouldn't be), but is merely to satisfy the x86 arch code's expectation that IOMMU drivers bring their own DMA ops, right? Robin. select INTERVAL_TREE help Para-virtualised IOMMU driver with virtio. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/8] Convert the intel iommu driver to the dma-iommu api
On 2019-12-23 10:37 am, Jani Nikula wrote: On Sat, 21 Dec 2019, Tom Murphy wrote: This patchset converts the intel iommu driver to the dma-iommu api. While converting the driver I exposed a bug in the intel i915 driver which causes a huge amount of artifacts on the screen of my laptop. You can see a picture of it here: https://github.com/pippy360/kernelPatches/blob/master/IMG_20191219_225922.jpg This issue is most likely in the i915 driver and is most likely caused by the driver not respecting the return value of the dma_map_ops::map_sg function. You can see the driver ignoring the return value here: https://github.com/torvalds/linux/blob/7e0165b2f1a912a06e381e91f0f4e495f4ac3736/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c#L51 Previously this didn’t cause issues because the intel map_sg always returned the same number of elements as the input scatter gather list but with the change to this dma-iommu api this is no longer the case. I wasn’t able to track the bug down to a specific line of code unfortunately. Could someone from the intel team look at this? Let me get this straight. There is current API that on success always returns the same number of elements as the input scatter gather list. You propose to change the API so that this is no longer the case? No, the API for dma_map_sg() has always been that it may return fewer DMA segments than nents - see Documentation/DMA-API.txt (and otherwise, the return value would surely be a simple success/fail condition). Relying on a particular implementation behaviour has never been strictly correct, even if it does happen to be a very common behaviour. A quick check of various dma_map_sg() calls in the kernel seems to indicate checking for 0 for errors and then ignoring the non-zero return is a common pattern. Are you sure it's okay to make the change you're proposing? Various code uses tricks like just iterating the mapped list until the first segment with zero sg_dma_len(). Others may well simply have bugs. Robin. Anyway, due to the time of year and all, I'd like to ask you to file a bug against i915 at [1] so this is not forgotten, and please let's not merge the changes before this is resolved. Thanks, Jani. [1] https://gitlab.freedesktop.org/drm/intel/issues/new ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/2] dma-mapping: Add dma_addr_is_phys_addr()
On 14/10/2019 05:51, David Gibson wrote: On Fri, Oct 11, 2019 at 06:25:18PM -0700, Ram Pai wrote: From: Thiago Jung Bauermann In order to safely use the DMA API, virtio needs to know whether DMA addresses are in fact physical addresses and for that purpose, dma_addr_is_phys_addr() is introduced. cc: Benjamin Herrenschmidt cc: David Gibson cc: Michael Ellerman cc: Paul Mackerras cc: Michael Roth cc: Alexey Kardashevskiy cc: Paul Burton cc: Robin Murphy cc: Bartlomiej Zolnierkiewicz cc: Marek Szyprowski cc: Christoph Hellwig Suggested-by: Michael S. Tsirkin Signed-off-by: Ram Pai Signed-off-by: Thiago Jung Bauermann The change itself looks ok, so Reviewed-by: David Gibson However, I would like to see the commit message (and maybe the inline comments) expanded a bit on what the distinction here is about. Some of the text from the next patch would be suitable, about DMA addresses usually being in a different address space but not in the case of bounce buffering. Right, this needs a much tighter definition. "DMA address happens to be a valid physical address" is true of various IOMMU setups too, but I can't believe it's meaningful in such cases. If what you actually want is "DMA is direct or SWIOTLB" - i.e. "DMA address is physical address of DMA data (not necessarily the original buffer)" - wouldn't dma_is_direct() suffice? Robin. --- arch/powerpc/include/asm/dma-mapping.h | 21 + arch/powerpc/platforms/pseries/Kconfig | 1 + include/linux/dma-mapping.h| 20 kernel/dma/Kconfig | 3 +++ 4 files changed, 45 insertions(+) diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h index 565d6f7..f92c0a4b 100644 --- a/arch/powerpc/include/asm/dma-mapping.h +++ b/arch/powerpc/include/asm/dma-mapping.h @@ -5,6 +5,8 @@ #ifndef _ASM_DMA_MAPPING_H #define _ASM_DMA_MAPPING_H +#include + static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) { /* We don't handle the NULL dev case for ISA for now. We could @@ -15,4 +17,23 @@ static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) return NULL; } +#ifdef CONFIG_ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR +/** + * dma_addr_is_phys_addr - check whether a device DMA address is a physical + * address + * @dev: device to check + * + * Returns %true if any DMA address for this device happens to also be a valid + * physical address (not necessarily of the same page). + */ +static inline bool dma_addr_is_phys_addr(struct device *dev) +{ + /* +* Secure guests always use the SWIOTLB, therefore DMA addresses are +* actually the physical address of the bounce buffer. +*/ + return is_secure_guest(); +} +#endif + #endif/* _ASM_DMA_MAPPING_H */ diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index 9e35cdd..0108150 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -152,6 +152,7 @@ config PPC_SVM select SWIOTLB select ARCH_HAS_MEM_ENCRYPT select ARCH_HAS_FORCE_DMA_UNENCRYPTED + select ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR help There are certain POWER platforms which support secure guests using the Protected Execution Facility, with the help of an Ultravisor diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index f7d1eea..6df5664 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -693,6 +693,26 @@ static inline bool dma_addressing_limited(struct device *dev) dma_get_required_mask(dev); } +#ifndef CONFIG_ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR +/** + * dma_addr_is_phys_addr - check whether a device DMA address is a physical + * address + * @dev: device to check + * + * Returns %true if any DMA address for this device happens to also be a valid + * physical address (not necessarily of the same page). + */ +static inline bool dma_addr_is_phys_addr(struct device *dev) +{ + /* +* Except in very specific setups, DMA addresses exist in a different +* address space from CPU physical addresses and cannot be directly used +* to reference system memory. +*/ + return false; +} +#endif + #ifdef CONFIG_ARCH_HAS_SETUP_DMA_OPS void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, const struct iommu_ops *iommu, bool coherent); diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index 9decbba..6209b46 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -51,6 +51,9 @@ config ARCH_HAS_DMA_MMAP_PGPROT config ARCH_HAS_FORCE_DMA_UNENCRYPTED bool +config ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR + bool + config DMA_NONCO
Re: [PATCH V5 4/5] iommu/dma-iommu: Use the dev->coherent_dma_mask
On 15/08/2019 12:09, Tom Murphy wrote: Use the dev->coherent_dma_mask when allocating in the dma-iommu ops api. Oops... I suppose technically that's my latent bug, but since we've all missed it so far, I doubt arm64 systems ever see any devices which actually have different masks. Reviewed-by: Robin Murphy Signed-off-by: Tom Murphy --- drivers/iommu/dma-iommu.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 906b7fa14d3c..b9a3ab02434b 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -471,7 +471,7 @@ static void __iommu_dma_unmap(struct device *dev, dma_addr_t dma_addr, } static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys, - size_t size, int prot) + size_t size, int prot, dma_addr_t dma_mask) { struct iommu_domain *domain = iommu_get_dma_domain(dev); struct iommu_dma_cookie *cookie = domain->iova_cookie; @@ -484,7 +484,7 @@ static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys, size = iova_align(iovad, size + iova_off); - iova = iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev); + iova = iommu_dma_alloc_iova(domain, size, dma_mask, dev); if (!iova) return DMA_MAPPING_ERROR; @@ -735,7 +735,7 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page, int prot = dma_info_to_prot(dir, coherent, attrs); dma_addr_t dma_handle; - dma_handle = __iommu_dma_map(dev, phys, size, prot); + dma_handle = __iommu_dma_map(dev, phys, size, prot, dma_get_mask(dev)); if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC) && dma_handle != DMA_MAPPING_ERROR) arch_sync_dma_for_device(dev, phys, size, dir); @@ -938,7 +938,8 @@ static dma_addr_t iommu_dma_map_resource(struct device *dev, phys_addr_t phys, size_t size, enum dma_data_direction dir, unsigned long attrs) { return __iommu_dma_map(dev, phys, size, - dma_info_to_prot(dir, false, attrs) | IOMMU_MMIO); + dma_info_to_prot(dir, false, attrs) | IOMMU_MMIO, + dma_get_mask(dev)); } static void iommu_dma_unmap_resource(struct device *dev, dma_addr_t handle, @@ -1041,7 +1042,8 @@ static void *iommu_dma_alloc(struct device *dev, size_t size, if (!cpu_addr) return NULL; - *handle = __iommu_dma_map(dev, page_to_phys(page), size, ioprot); + *handle = __iommu_dma_map(dev, page_to_phys(page), size, ioprot, + dev->coherent_dma_mask); if (*handle == DMA_MAPPING_ERROR) { __iommu_dma_free(dev, size, cpu_addr); return NULL; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V5 3/5] iommu/dma-iommu: Handle deferred devices
On 15/08/2019 12:09, Tom Murphy wrote: Handle devices which defer their attach to the iommu in the dma-iommu api Other than nitpicking the name (I'd lean towards something like iommu_dma_deferred_attach), Reviewed-by: Robin Murphy Signed-off-by: Tom Murphy --- drivers/iommu/dma-iommu.c | 27 ++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 2712fbc68b28..906b7fa14d3c 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -22,6 +22,7 @@ #include #include #include +#include struct iommu_dma_msi_page { struct list_headlist; @@ -351,6 +352,21 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, return iova_reserve_iommu_regions(dev, domain); } +static int handle_deferred_device(struct device *dev, + struct iommu_domain *domain) +{ + const struct iommu_ops *ops = domain->ops; + + if (!is_kdump_kernel()) + return 0; + + if (unlikely(ops->is_attach_deferred && + ops->is_attach_deferred(domain, dev))) + return iommu_attach_device(domain, dev); + + return 0; +} + /** * dma_info_to_prot - Translate DMA API directions and attributes to IOMMU API *page flags. @@ -463,6 +479,9 @@ static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys, size_t iova_off = iova_offset(iovad, phys); dma_addr_t iova; + if (unlikely(handle_deferred_device(dev, domain))) + return DMA_MAPPING_ERROR; + size = iova_align(iovad, size + iova_off); iova = iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev); @@ -581,6 +600,9 @@ static void *iommu_dma_alloc_remap(struct device *dev, size_t size, *dma_handle = DMA_MAPPING_ERROR; + if (unlikely(handle_deferred_device(dev, domain))) + return NULL; + min_size = alloc_sizes & -alloc_sizes; if (min_size < PAGE_SIZE) { min_size = PAGE_SIZE; @@ -713,7 +735,7 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page, int prot = dma_info_to_prot(dir, coherent, attrs); dma_addr_t dma_handle; - dma_handle =__iommu_dma_map(dev, phys, size, prot); + dma_handle = __iommu_dma_map(dev, phys, size, prot); if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC) && dma_handle != DMA_MAPPING_ERROR) arch_sync_dma_for_device(dev, phys, size, dir); @@ -823,6 +845,9 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, unsigned long mask = dma_get_seg_boundary(dev); int i; + if (unlikely(handle_deferred_device(dev, domain))) + return 0; + if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC)) iommu_dma_sync_sg_for_device(dev, sg, nents, dir); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V5 2/5] iommu: Add gfp parameter to iommu_ops::map
On 15/08/2019 12:09, Tom Murphy wrote: Add a gfp_t parameter to the iommu_ops::map function. Remove the needless locking in the AMD iommu driver. The iommu_ops::map function (or the iommu_map function which calls it) was always supposed to be sleepable (according to Joerg's comment in this thread: https://lore.kernel.org/patchwork/patch/977520/ ) and so should probably have had a "might_sleep()" since it was written. However currently the dma-iommu api can call iommu_map in an atomic context, which it shouldn't do. This doesn't cause any problems because any iommu driver which uses the dma-iommu api uses gfp_atomic in it's iommu_ops::map function. But doing this wastes the memory allocators atomic pools. Looks reasonable to me - once we get the merges sorted out I'll take a look at propagating the flags through to io-pgtable for the SMMU drivers and friends. Reviewed-by: Robin Murphy Signed-off-by: Tom Murphy --- drivers/iommu/amd_iommu.c | 3 ++- drivers/iommu/arm-smmu-v3.c| 2 +- drivers/iommu/arm-smmu.c | 2 +- drivers/iommu/dma-iommu.c | 6 ++--- drivers/iommu/exynos-iommu.c | 2 +- drivers/iommu/intel-iommu.c| 2 +- drivers/iommu/iommu.c | 43 +- drivers/iommu/ipmmu-vmsa.c | 2 +- drivers/iommu/msm_iommu.c | 2 +- drivers/iommu/mtk_iommu.c | 2 +- drivers/iommu/mtk_iommu_v1.c | 2 +- drivers/iommu/omap-iommu.c | 2 +- drivers/iommu/qcom_iommu.c | 2 +- drivers/iommu/rockchip-iommu.c | 2 +- drivers/iommu/s390-iommu.c | 2 +- drivers/iommu/tegra-gart.c | 2 +- drivers/iommu/tegra-smmu.c | 2 +- drivers/iommu/virtio-iommu.c | 2 +- include/linux/iommu.h | 21 - 19 files changed, 77 insertions(+), 26 deletions(-) diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index 1948be7ac8f8..0e53f9bd2be7 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -3030,7 +3030,8 @@ static int amd_iommu_attach_device(struct iommu_domain *dom, } static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova, -phys_addr_t paddr, size_t page_size, int iommu_prot) +phys_addr_t paddr, size_t page_size, int iommu_prot, +gfp_t gfp) { struct protection_domain *domain = to_pdomain(dom); int prot = 0; diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index e7f49fd1a7ba..acc0eae7963f 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -1975,7 +1975,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev) } static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova, - phys_addr_t paddr, size_t size, int prot) + phys_addr_t paddr, size_t size, int prot, gfp_t gfp) { struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops; diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index aa06498f291d..05f42bdee494 100644 --- a/drivers/iommu/arm-smmu.c +++ b/drivers/iommu/arm-smmu.c @@ -1284,7 +1284,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev) } static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova, - phys_addr_t paddr, size_t size, int prot) + phys_addr_t paddr, size_t size, int prot, gfp_t gfp) { struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops; struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu; diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index d991d40f797f..2712fbc68b28 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -469,7 +469,7 @@ static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys, if (!iova) return DMA_MAPPING_ERROR; - if (iommu_map(domain, iova, phys - iova_off, size, prot)) { + if (iommu_map_atomic(domain, iova, phys - iova_off, size, prot)) { iommu_dma_free_iova(cookie, iova, size); return DMA_MAPPING_ERROR; } @@ -613,7 +613,7 @@ static void *iommu_dma_alloc_remap(struct device *dev, size_t size, arch_dma_prep_coherent(sg_page(sg), sg->length); } - if (iommu_map_sg(domain, iova, sgt.sgl, sgt.orig_nents, ioprot) + if (iommu_map_sg_atomic(domain, iova, sgt.sgl, sgt.orig_nents, ioprot) < size) goto out_free_sg; @@ -873,7 +873,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, * We'll leave any physical concatenation to the IOMMU driver's * implementation - it knows better than we do. */ - if (iommu_map_sg(domain, iova, sg, nents, prot) < iova_len) + if (iommu_map_sg_atomic(d
Re: [PATCH 2/2] virtio/virtio_ring: Fix the dma_max_mapping_size call
On 22/07/2019 15:55, Eric Auger wrote: Do not call dma_max_mapping_size for devices that have no DMA mask set, otherwise we can hit a NULL pointer dereference. This occurs when a virtio-blk-pci device is protected with a virtual IOMMU. Fixes: e6d6dd6c875e ("virtio: Introduce virtio_max_dma_size()") Signed-off-by: Eric Auger Suggested-by: Christoph Hellwig --- drivers/virtio/virtio_ring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index c8be1c4f5b55..37c143971211 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -262,7 +262,7 @@ size_t virtio_max_dma_size(struct virtio_device *vdev) { size_t max_segment_size = SIZE_MAX; - if (vring_use_dma_api(vdev)) + if (vring_use_dma_api(vdev) && vdev->dev.dma_mask) Hmm, might it make sense to roll that check up into vring_use_dma_api() itself? After all, if the device has no mask then it's likely that other DMA API ops wouldn't really work as expected either. Robin. max_segment_size = dma_max_mapping_size(>dev); return max_segment_size; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/3] Fix virtio-blk issue with SWIOTLB
On 14/01/2019 18:20, Michael S. Tsirkin wrote: On Mon, Jan 14, 2019 at 08:41:37PM +0800, Jason Wang wrote: On 2019/1/14 下午5:50, Christoph Hellwig wrote: On Mon, Jan 14, 2019 at 05:41:56PM +0800, Jason Wang wrote: On 2019/1/11 下午5:15, Joerg Roedel wrote: On Fri, Jan 11, 2019 at 11:29:31AM +0800, Jason Wang wrote: Just wonder if my understanding is correct IOMMU_PLATFORM must be set for all virtio devices under AMD-SEV guests? Yes, that is correct. Emulated DMA can only happen on the SWIOTLB aperture, because that memory is not encrypted. The guest bounces the data then to its encrypted memory. Regards, Joerg Thanks, have you tested vhost-net in this case. I suspect it may not work Which brings me back to my pet pevee that we need to take actions that virtio uses the proper dma mapping API by default with quirks for legacy cases. The magic bypass it uses is just causing problems over problems. Yes, I fully agree with you. This is probably an exact example of such problem. Thanks I don't think so - the issue is really that DMA API does not yet handle the SEV case 100% correctly. I suspect passthrough devices would have the same issue. Huh? Regardless of which virtio devices use it or not, the DMA API is handling the SEV case as correctly as it possibly can, by forcing everything through the unencrypted bounce buffer. If the segments being mapped are too big for that bounce buffer in the first place, there's nothing it can possibly do except fail, gracefully or otherwise. Now, in theory, yes, the real issue at hand is not unique to virtio-blk nor SEV - any driver whose device has a sufficiently large DMA segment size and who manages to get sufficient physically-contiguous memory could technically generate a scatterlist segment longer than SWIOTLB can handle. However, in practice that basically never happens, not least because very few drivers ever override the default 64K DMA segment limit. AFAICS nothing in drivers/virtio is calling dma_set_max_seg_size() or otherwise assigning any dma_parms to replace the defaults either, so the really interesting question here is how are these apparently-out-of-spec 256K segments getting generated at all? Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] Re: [PATCH v5 5/7] iommu: Add virtio-iommu driver
On 2018-12-12 3:27 pm, Auger Eric wrote: Hi, On 12/12/18 3:56 PM, Michael S. Tsirkin wrote: On Fri, Dec 07, 2018 at 06:52:31PM +, Jean-Philippe Brucker wrote: Sorry for the delay, I wanted to do a little more performance analysis before continuing. On 27/11/2018 18:10, Michael S. Tsirkin wrote: On Tue, Nov 27, 2018 at 05:55:20PM +, Jean-Philippe Brucker wrote: + if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1) || + !virtio_has_feature(vdev, VIRTIO_IOMMU_F_MAP_UNMAP)) Why bother with a feature bit for this then btw? We'll need a new feature bit for sharing page tables with the hardware, because they require different requests (attach_table/invalidate instead of map/unmap.) A future device supporting page table sharing won't necessarily need to support map/unmap. I don't see virtio iommu being extended to support ARM specific requests. This just won't scale, too many different descriptor formats out there. They aren't really ARM specific requests. The two new requests are ATTACH_TABLE and INVALIDATE, which would be used by x86 IOMMUs as well. Sharing CPU address space with the HW IOMMU (SVM) has been in the scope of virtio-iommu since the first RFC, and I've been working with that extension in mind since the beginning. As an example you can have a look at my current draft for this [1], which is inspired from the VFIO work we've been doing with Intel. The negotiation phase inevitably requires vendor-specific fields in the descriptors - host tells which formats are supported, guest chooses a format and attaches page tables. But invalidation and fault reporting descriptors are fairly generic. We need to tread carefully here. People expect it that if user does lspci and sees a virtio device then it's reasonably portable. If you want to go that way down the road, you should avoid virtio iommu, instead emulate and share code with the ARM SMMU (probably with a different vendor id so you can implement the report on map for devices without PRI). vSMMU has to stay in userspace though. The main reason we're proposing virtio-iommu is that emulating every possible vIOMMU model in the kernel would be unmaintainable. With virtio-iommu we can process the fast path in the host kernel, through vhost-iommu, and do the heavy lifting in userspace. Interesting. As said above, I'm trying to keep the fast path for virtio-iommu generic. More notes on what I consider to be the fast path, and comparison with vSMMU: (1) The primary use-case we have in mind for vIOMMU is something like DPDK in the guest, assigning a hardware device to guest userspace. DPDK maps a large amount of memory statically, to be used by a pass-through device. For this case I don't think we care about vIOMMU performance. Setup and teardown need to be reasonably fast, sure, but the MAP/UNMAP requests don't have to be optimal. (2) If the assigned device is owned by the guest kernel, then mappings are dynamic and require dma_map/unmap() to be fast, but there generally is no need for a vIOMMU, since device and drivers are trusted by the guest kernel. Even when the user does enable a vIOMMU for this case (allowing to over-commit guest memory, which needs to be pinned otherwise), BTW that's in theory in practice it doesn't really work. we generally play tricks like lazy TLBI (non-strict mode) to make it faster. Simple lazy TLB for guest/userspace drivers would be a big no no. You need something smarter. Here device and drivers are trusted, therefore the vulnerability window of lazy mode isn't a concern. If the reason to enable the vIOMMU is over-comitting guest memory however, you can't use nested translation because it requires pinning the second-level tables. For this case performance matters a bit, because your invalidate-on-map needs to be fast, even if you enable lazy mode and only receive inval-on-unmap every 10ms. It won't ever be as fast as nested translation, though. For this case I think vSMMU+Caching Mode and userspace virtio-iommu with MAP/UNMAP would perform similarly (given page-sized payloads), because the pagetable walk doesn't add a lot of overhead compared to the context switch. But given the results below, vhost-iommu would be faster than vSMMU+CM. (3) Then there is SVM. For SVM, any destructive change to the process address space requires a synchronous invalidation command to the hardware (at least when using PCI ATS). Given that SVM is based on page faults, fault reporting from host to guest also needs to be fast, as well as fault response from guest to host. I think this is where performance matters the most. To get a feel of the advantage we get with virtio-iommu, I compared the vSMMU page-table sharing implementation [2] and vhost-iommu + VFIO with page table sharing (based on Tomasz Nowicki's vhost-iommu prototype). That's on a ThunderX2 with a 10Gb NIC assigned to the guest kernel, which corresponds to case (2) above, with nesting page tables and without the lazy mode. The
Re: [PATCH v3 3/7] PCI: OF: Allow endpoints to bypass the iommu
On 17/10/18 16:14, Michael S. Tsirkin wrote: On Mon, Oct 15, 2018 at 08:46:41PM +0100, Jean-philippe Brucker wrote: [Replying with my personal address because we're having SMTP issues] On 15/10/2018 11:52, Michael S. Tsirkin wrote: On Fri, Oct 12, 2018 at 02:41:59PM -0500, Bjorn Helgaas wrote: s/iommu/IOMMU/ in subject On Fri, Oct 12, 2018 at 03:59:13PM +0100, Jean-Philippe Brucker wrote: Using the iommu-map binding, endpoints in a given PCI domain can be managed by different IOMMUs. Some virtual machines may allow a subset of endpoints to bypass the IOMMU. In some case the IOMMU itself is presented s/case/cases/ as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a PCI root complex has an iommu-map property, the driver requires all endpoints to be described by the property. Allow the iommu-map property to have gaps. I'm not an IOMMU or virtio expert, so it's not obvious to me why it is safe to allow devices to bypass the IOMMU. Does this mean a typo in iommu-map could inadvertently allow devices to bypass it? Thinking about this comment, I would like to ask: can't the virtio device indicate the ranges in a portable way? This would minimize the dependency on dt bindings and ACPI, enabling support for systems that have neither but do have virtio e.g. through pci. I thought about adding a PROBE request for this in virtio-iommu, but it wouldn't be usable by a Linux guest because of a bootstrapping problem. Hmm. At some level it seems wrong to design hardware interfaces around how Linux happens to probe things. That can change at any time ... This isn't Linux-specific though. In general it's somewhere between difficult and impossible to pull in an IOMMU underneath a device after at device is active, so if any OS wants to use an IOMMU, it's going to want to know up-front that it's there and which devices it translates so that it can program said IOMMU appropriately *before* potentially starting DMA and/or interrupts from the relevant devices. Linux happens to do things in that order (either by firmware-driven probe-deferral or just perilous initcall ordering) because it is the only reasonable order in which to do them. AFAIK the platforms which don't rely on any firmware description of their IOMMU tend to have a fairly static system architecture (such that the OS simply makes hard-coded assumptions), so it's not necessarily entirely clear how they would cope with virtio-iommu either way. Robin. Early on, Linux needs a description of device dependencies, to determine in which order to probe them. If the device dependency was described by virtio-iommu itself, the guest could for example initialize a NIC, allocate buffers and start DMA on the physical address space (which aborts if the IOMMU implementation disallows DMA by default), only to find out once the virtio-iommu module is loaded that it needs to cancel all DMA and reconfigure the NIC. With a static description such as iommu-map in DT or ACPI remapping tables, the guest can defer probing of the NIC until the IOMMU is initialized. Thanks, Jean Could you point me at the code you refer to here? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 3/7] PCI: OF: Allow endpoints to bypass the iommu
On 12/10/18 20:41, Bjorn Helgaas wrote: s/iommu/IOMMU/ in subject On Fri, Oct 12, 2018 at 03:59:13PM +0100, Jean-Philippe Brucker wrote: Using the iommu-map binding, endpoints in a given PCI domain can be managed by different IOMMUs. Some virtual machines may allow a subset of endpoints to bypass the IOMMU. In some case the IOMMU itself is presented s/case/cases/ as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a PCI root complex has an iommu-map property, the driver requires all endpoints to be described by the property. Allow the iommu-map property to have gaps. I'm not an IOMMU or virtio expert, so it's not obvious to me why it is safe to allow devices to bypass the IOMMU. Does this mean a typo in iommu-map could inadvertently allow devices to bypass it? Should we indicate something in dmesg (and/or sysfs) about devices that bypass it? It's not really "allow devices to bypass the IOMMU" so much as "allow DT to describe devices which the IOMMU doesn't translate". It's a bit of an edge case for not-really-PCI devices, but FWIW I can certainly think of several ways to build real hardware like that. As for inadvertent errors leaving out IDs which *should* be in the map, that really depends on the IOMMU/driver implementation - e.g. SMMUv2 with arm-smmu.disable_bypass=0 would treat the device as untranslated, whereas SMMUv3 would always generate a fault upon any transaction due to no valid stream table entry being programmed (not even a bypass one). I reckon it's a sufficiently unusual case that keeping some sort of message probably is worthwhile (at pr_info rather than pr_err) in case someone does hit it by mistake. Relaxing of_pci_map_rid also allows the msi-map property to have gaps, At worst, I suppose we could always add yet another parameter for each caller to choose whether a missing entry is considered an error or not. Robin. s/of_pci_map_rid/of_pci_map_rid()/ which is invalid since MSIs always reach an MSI controller. Thankfully Linux will error out later, when attempting to find an MSI domain for the device. Not clear to me what "error out" means here. In a userspace program, I would infer that the program exits with an error message, but I doubt you mean that Linux exits. Signed-off-by: Jean-Philippe Brucker --- drivers/pci/of.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/pci/of.c b/drivers/pci/of.c index 1836b8ddf292..2f5015bdb256 100644 --- a/drivers/pci/of.c +++ b/drivers/pci/of.c @@ -451,9 +451,10 @@ int of_pci_map_rid(struct device_node *np, u32 rid, return 0; } - pr_err("%pOF: Invalid %s translation - no match for rid 0x%x on %pOF\n", - np, map_name, rid, target && *target ? *target : NULL); - return -EFAULT; + /* Bypasses translation */ + if (id_out) + *id_out = rid; + return 0; } #if IS_ENABLED(CONFIG_OF_IRQ) -- 2.19.1 ___ iommu mailing list io...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 1/5] dt-bindings: virtio: Specify #iommu-cells value for a virtio-iommu
On 27/06/18 18:46, Rob Herring wrote: On Tue, Jun 26, 2018 at 11:59 AM Jean-Philippe Brucker wrote: On 25/06/18 20:27, Rob Herring wrote: On Thu, Jun 21, 2018 at 08:06:51PM +0100, Jean-Philippe Brucker wrote: A virtio-mmio node may represent a virtio-iommu device. This is discovered by the virtio driver at probe time, but the DMA topology isn't discoverable and must be described by firmware. For DT the standard IOMMU description is used, as specified in bindings/iommu/iommu.txt and bindings/pci/pci-iommu.txt. Like many other IOMMUs, virtio-iommu distinguishes masters by their endpoint IDs, which requires one IOMMU cell in the "iommus" property. Signed-off-by: Jean-Philippe Brucker --- Documentation/devicetree/bindings/virtio/mmio.txt | 8 1 file changed, 8 insertions(+) diff --git a/Documentation/devicetree/bindings/virtio/mmio.txt b/Documentation/devicetree/bindings/virtio/mmio.txt index 5069c1b8e193..337da0e3a87f 100644 --- a/Documentation/devicetree/bindings/virtio/mmio.txt +++ b/Documentation/devicetree/bindings/virtio/mmio.txt @@ -8,6 +8,14 @@ Required properties: - reg: control registers base address and size including configuration space - interrupts: interrupt generated by the device +Required properties for virtio-iommu: + +- #iommu-cells: When the node describes a virtio-iommu device, it is +linked to DMA masters using the "iommus" property as +described in devicetree/bindings/iommu/iommu.txt. For +virtio-iommu #iommu-cells must be 1, each cell describing +a single endpoint ID. The iommus property should also be documented for the client side. Isn't section "IOMMU master node" of iommu.txt sufficient? Since the iommus property applies to any DMA master, not only virtio-mmio devices, the canonical description in iommu.txt seems the best place for it, and I'm not sure what to add in this file. Maybe a short example below the virtio_block one? No, because somewhere we have to capture if 'iommus' is valid for 'virtio-mmio' or not. Hopefully soon we'll actually be able to validate that. Indeed, it's rather unusual to have a single compatible which may either be an IOMMU or an IOMMU client (but not both at once, I hope!), so nailing down the exact semantics as clearly as possible would definitely be desirable. Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/4] iommu/virtio: Add probe request
On 14/02/18 14:53, Jean-Philippe Brucker wrote: When the device offers the probe feature, send a probe request for each device managed by the IOMMU. Extract RESV_MEM information. When we encounter a MSI doorbell region, set it up as a IOMMU_RESV_MSI region. This will tell other subsystems that there is no need to map the MSI doorbell in the virtio-iommu, because MSIs bypass it. Signed-off-by: Jean-Philippe Brucker--- drivers/iommu/virtio-iommu.c | 163 -- include/uapi/linux/virtio_iommu.h | 37 + 2 files changed, 193 insertions(+), 7 deletions(-) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index a9c9245e8ba2..3ac4b38eaf19 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -45,6 +45,7 @@ struct viommu_dev { struct iommu_domain_geometrygeometry; u64 pgsize_bitmap; u8 domain_bits; + u32 probe_size; }; struct viommu_mapping { @@ -72,6 +73,7 @@ struct viommu_domain { struct viommu_endpoint { struct viommu_dev *viommu; struct viommu_domain*vdomain; + struct list_headresv_regions; }; struct viommu_request { @@ -140,6 +142,10 @@ static int viommu_get_req_size(struct viommu_dev *viommu, case VIRTIO_IOMMU_T_UNMAP: size = sizeof(r->unmap); break; + case VIRTIO_IOMMU_T_PROBE: + *bottom += viommu->probe_size; + size = sizeof(r->probe) + *bottom; + break; default: return -EINVAL; } @@ -448,6 +454,105 @@ static int viommu_replay_mappings(struct viommu_domain *vdomain) return ret; } +static int viommu_add_resv_mem(struct viommu_endpoint *vdev, + struct virtio_iommu_probe_resv_mem *mem, + size_t len) +{ + struct iommu_resv_region *region = NULL; + unsigned long prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO; + + u64 addr = le64_to_cpu(mem->addr); + u64 size = le64_to_cpu(mem->size); + + if (len < sizeof(*mem)) + return -EINVAL; + + switch (mem->subtype) { + case VIRTIO_IOMMU_RESV_MEM_T_MSI: + region = iommu_alloc_resv_region(addr, size, prot, +IOMMU_RESV_MSI); + break; + case VIRTIO_IOMMU_RESV_MEM_T_RESERVED: + default: + region = iommu_alloc_resv_region(addr, size, 0, +IOMMU_RESV_RESERVED); + break; + } + + list_add(>resv_regions, >list); + + /* +* Treat unknown subtype as RESERVED, but urge users to update their +* driver. +*/ + if (mem->subtype != VIRTIO_IOMMU_RESV_MEM_T_RESERVED && + mem->subtype != VIRTIO_IOMMU_RESV_MEM_T_MSI) + pr_warn("unknown resv mem subtype 0x%x\n", mem->subtype); Might as well avoid the extra comparisons by incorporating this into the switch statement, i.e.: default: dev_warn(vdev->viommu_dev->dev, ...); /* Fallthrough */ case VIRTIO_IOMMU_RESV_MEM_T_RESERVED: ... (dev_warn is generally preferable to pr_warn when feasible) + + return 0; +} + +static int viommu_probe_endpoint(struct viommu_dev *viommu, struct device *dev) +{ + int ret; + u16 type, len; + size_t cur = 0; + struct virtio_iommu_req_probe *probe; + struct virtio_iommu_probe_property *prop; + struct iommu_fwspec *fwspec = dev->iommu_fwspec; + struct viommu_endpoint *vdev = fwspec->iommu_priv; + + if (!fwspec->num_ids) + /* Trouble ahead. */ + return -EINVAL; + + probe = kzalloc(sizeof(*probe) + viommu->probe_size + + sizeof(struct virtio_iommu_req_tail), GFP_KERNEL); + if (!probe) + return -ENOMEM; + + probe->head.type = VIRTIO_IOMMU_T_PROBE; + /* +* For now, assume that properties of an endpoint that outputs multiple +* IDs are consistent. Only probe the first one. +*/ + probe->endpoint = cpu_to_le32(fwspec->ids[0]); + + ret = viommu_send_req_sync(viommu, probe); + if (ret) + goto out_free; + + prop = (void *)probe->properties; + type = le16_to_cpu(prop->type) & VIRTIO_IOMMU_PROBE_T_MASK; + + while (type != VIRTIO_IOMMU_PROBE_T_NONE && + cur < viommu->probe_size) { + len = le16_to_cpu(prop->length); + + switch (type) { + case VIRTIO_IOMMU_PROBE_T_RESV_MEM: + ret = viommu_add_resv_mem(vdev, (void *)prop->value, len); + break; +
Re: [PATCH 1/4] iommu: Add virtio-iommu driver
On 14/02/18 14:53, Jean-Philippe Brucker wrote: The virtio IOMMU is a para-virtualized device, allowing to send IOMMU requests such as map/unmap over virtio-mmio transport without emulating page tables. This implementation handles ATTACH, DETACH, MAP and UNMAP requests. The bulk of the code transforms calls coming from the IOMMU API into corresponding virtio requests. Mappings are kept in an interval tree instead of page tables. Signed-off-by: Jean-Philippe Brucker--- MAINTAINERS | 6 + drivers/iommu/Kconfig | 11 + drivers/iommu/Makefile| 1 + drivers/iommu/virtio-iommu.c | 960 ++ include/uapi/linux/virtio_ids.h | 1 + include/uapi/linux/virtio_iommu.h | 116 + 6 files changed, 1095 insertions(+) create mode 100644 drivers/iommu/virtio-iommu.c create mode 100644 include/uapi/linux/virtio_iommu.h diff --git a/MAINTAINERS b/MAINTAINERS index 3bdc260e36b7..2a181924d420 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14818,6 +14818,12 @@ S: Maintained F:drivers/virtio/virtio_input.c F:include/uapi/linux/virtio_input.h +VIRTIO IOMMU DRIVER +M: Jean-Philippe Brucker +S: Maintained +F: drivers/iommu/virtio-iommu.c +F: include/uapi/linux/virtio_iommu.h + VIRTUAL BOX GUEST DEVICE DRIVER M:Hans de Goede M:Arnd Bergmann diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index f3a21343e636..1ea0ec74524f 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -381,4 +381,15 @@ config QCOM_IOMMU help Support for IOMMU on certain Qualcomm SoCs. +config VIRTIO_IOMMU + bool "Virtio IOMMU driver" + depends on VIRTIO_MMIO + select IOMMU_API + select INTERVAL_TREE + select ARM_DMA_USE_IOMMU if ARM + help + Para-virtualised IOMMU driver with virtio. + + Say Y here if you intend to run this kernel as a guest. + endif # IOMMU_SUPPORT diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index 1fb695854809..9c68be1365e1 100644 --- a/drivers/iommu/Makefile +++ b/drivers/iommu/Makefile @@ -29,3 +29,4 @@ obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o obj-$(CONFIG_S390_IOMMU) += s390-iommu.o obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o +obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c new file mode 100644 index ..a9c9245e8ba2 --- /dev/null +++ b/drivers/iommu/virtio-iommu.c @@ -0,0 +1,960 @@ +/* + * Virtio driver for the paravirtualized IOMMU + * + * Copyright (C) 2018 ARM Limited + * Author: Jean-Philippe Brucker + * + * SPDX-License-Identifier: GPL-2.0 This wants to be a // comment at the very top of the file (thankfully the policy is now properly documented in-tree since Documentation/process/license-rules.rst got merged) + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#define MSI_IOVA_BASE 0x800 +#define MSI_IOVA_LENGTH0x10 + +struct viommu_dev { + struct iommu_device iommu; + struct device *dev; + struct virtio_device*vdev; + + struct ida domain_ids; + + struct virtqueue*vq; + /* Serialize anything touching the request queue */ + spinlock_t request_lock; + + /* Device configuration */ + struct iommu_domain_geometrygeometry; + u64 pgsize_bitmap; + u8 domain_bits; +}; + +struct viommu_mapping { + phys_addr_t paddr; + struct interval_tree_node iova; + union { + struct virtio_iommu_req_map map; + struct virtio_iommu_req_unmap unmap; + } req; +}; + +struct viommu_domain { + struct iommu_domain domain; + struct viommu_dev *viommu; + struct mutexmutex; + unsigned intid; + + spinlock_t mappings_lock; + struct rb_root_cached mappings; + + /* Number of endpoints attached to this domain */ + unsigned long endpoints; +}; + +struct viommu_endpoint { + struct viommu_dev *viommu; + struct viommu_domain*vdomain; +}; + +struct viommu_request { + struct scatterlist top; + struct scatterlist bottom; + + int
Re: [PATCH 1/4] iommu: Add virtio-iommu driver
On 21/03/18 13:14, Jean-Philippe Brucker wrote: On 21/03/18 06:43, Tian, Kevin wrote: [...] + +#include + +#define MSI_IOVA_BASE 0x800 +#define MSI_IOVA_LENGTH0x10 this is ARM specific, and according to virtio-iommu spec isn't it better probed on the endpoint instead of hard-coding here? These values are arbitrary, not really ARM-specific even if ARM is the only user yet: we're just reserving a random IOVA region for mapping MSIs. It is hard-coded because of the way iommu-dma.c works, but I don't quite remember why that allocation isn't dynamic. The host kernel needs to have *some* MSI region in place before the guest can start configuring interrupts, otherwise it won't know what address to give to the underlying hardware. However, as soon as the host kernel has picked a region, host userspace needs to know that it can no longer use addresses in that region for DMA-able guest memory. It's a lot easier when the address is fixed in hardware and the host userspace will never be stupid enough to try and VFIO_IOMMU_DMA_MAP it, but in the more general case where MSI writes undergo IOMMU address translation so it's an arbitrary IOVA, this has the potential to conflict with stuff like guest memory hotplug. What we currently have is just the simplest option, with the host kernel just picking something up-front and pretending to host userspace that it's a fixed hardware address. There's certainly scope for it to be a bit more dynamic in the sense of adding an interface to let userspace move it around (before attaching any devices, at least), but I don't think it's feasible for the host kernel to second-guess userspace enough to make it entirely transparent like it is in the DMA API domain case. Of course, that's all assuming the host itself is using a virtio-iommu (e.g. in a nested virt or emulation scenario). When it's purely within a guest then an MSI reservation shouldn't matter so much, since the guest won't be anywhere near the real hardware configuration anyway. Robin. As said on the v0.6 spec thread, I'm not sure allocating the IOVA range in the host is preferable. With nested translation the guest has to map it anyway, and I believe dealing with IOVA allocation should be left to the guest when possible. Thanks, Jean ___ iommu mailing list io...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 4/4] vfio: Allow type-1 IOMMU instantiation with a virtio-iommu
On 14/02/18 15:26, Alex Williamson wrote: On Wed, 14 Feb 2018 14:53:40 + Jean-Philippe Bruckerwrote: When enabling both VFIO and VIRTIO_IOMMU modules, automatically select VFIO_IOMMU_TYPE1 as well. Signed-off-by: Jean-Philippe Brucker --- drivers/vfio/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index c84333eb5eb5..65a1e691110c 100644 --- a/drivers/vfio/Kconfig +++ b/drivers/vfio/Kconfig @@ -21,7 +21,7 @@ config VFIO_VIRQFD menuconfig VFIO tristate "VFIO Non-Privileged userspace driver framework" depends on IOMMU_API - select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM_SMMU || ARM_SMMU_V3) + select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM_SMMU || ARM_SMMU_V3 || VIRTIO_IOMMU) select ANON_INODES help VFIO provides a framework for secure userspace device drivers. Why are we basing this on specific IOMMU drivers in the first place? Only ARM is doing that. Shouldn't IOMMU_API only be enabled for ARM targets that support it and therefore we can forget about the specific IOMMU drivers? Thanks, Makes sense - the majority of ARM systems (and mobile/embedded ARM64 ones) making use of IOMMU_API won't actually support VFIO, but it can't hurt to allow them to select the type 1 driver regardless. Especially as multiplatform configs are liable to be pulling in the SMMU driver(s) anyway. Robin. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v2 1/2] virtio: Make ARM SMMU workaround more specific
Whilst always using the DMA API is OK on ARM systems in most cases, there can be a problem if a hypervisor fails to tell its guest that a virtio device is cache-coherent. In that case, the guest will end up making non-cacheable mappings for DMA buffers (i.e. the vring), which, if the host is using a cacheable view of the same buffer on the other end, is not a recipe for success. It turns out that current kvmtool, and probably QEMU as well, runs into this exact problem, and a guest using a virtio console can be seen to hang pretty quickly after writing a few characters as host data in cache and guest data directly in RAM go out of sync. In order to fix this, narrow the scope of the original workaround from all legacy devices to just those behind IOMMUs, which was really the only thing we were trying to deal with in the first place. Fixes: c7070619f340 ("vring: Force use of DMA API for ARM-based systems with legacy devices") Signed-off-by: Robin Murphy <robin.mur...@arm.com> --- drivers/virtio/virtio_ring.c | 30 -- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 7e38ed79c3fc..03e824c77d61 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -117,6 +118,27 @@ struct vring_virtqueue { #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq) /* + * ARM Fast Models are hopefully unique in implementing "hardware" legacy + * virtio block devices, which can be placed behind a "real" IOMMU, but are + * unaware of VIRTIO_F_IOMMU_PLATFORM. Fortunately, we can detect whether + * an IOMMU is present and in use by checking whether an IOMMU driver has + * assigned the DMA master device a group. + */ +static bool vring_arm_legacy_dma_quirk(struct virtio_device *vdev) +{ + struct iommu_group *group; + + if (!(IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_ARM64)) || + virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) + return false; + + group = iommu_group_get(vdev->dev.parent); + iommu_group_put(group); + + return group != NULL; +} + +/* * Modern virtio devices have feature bits to specify whether they need a * quirk and bypass the IOMMU. If not there, just use the DMA API. * @@ -159,12 +181,8 @@ static bool vring_use_dma_api(struct virtio_device *vdev) if (xen_domain()) return true; - /* -* On ARM-based machines, the DMA ops will do the right thing, -* so always use them with legacy devices. -*/ - if (IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_ARM64)) - return !virtio_has_feature(vdev, VIRTIO_F_VERSION_1); + if (vring_arm_legacy_dma_quirk(vdev)) + return true; return false; } -- 2.11.0.dirty ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v2 2/2] virtio: Document DMA coherency
Since making use of the DMA API will require the architecture code to have the correct notion of device cache-coherency on architectures like ARM, explicitly call this out in the virtio-mmio DT binding. The ship has sailed for legacy virtio, but let's hope that we can head off any future firmware mishaps. Signed-off-by: Robin Murphy <robin.mur...@arm.com> --- Documentation/devicetree/bindings/virtio/mmio.txt | 11 +++ 1 file changed, 11 insertions(+) diff --git a/Documentation/devicetree/bindings/virtio/mmio.txt b/Documentation/devicetree/bindings/virtio/mmio.txt index 5069c1b8e193..999a93faa67c 100644 --- a/Documentation/devicetree/bindings/virtio/mmio.txt +++ b/Documentation/devicetree/bindings/virtio/mmio.txt @@ -7,6 +7,16 @@ Required properties: - compatible: "virtio,mmio" compatibility string - reg: control registers base address and size including configuration space - interrupts: interrupt generated by the device +- dma-coherent:required if the device (or host emulation) accesses memory + cache-coherently, absent otherwise + +Linux implementation note: + +virtio devices not advertising the VIRTIO_F_IOMMU_PLATFORM flag have been +implicitly assumed to be cache-coherent by Linux, and for legacy reasons this +behaviour is likely to remain. If VIRTIO_F_IOMMU_PLATFORM is advertised, then +such assumptions cannot be relied upon and the "dma-coherent" property must +accurately reflect the coherency of the device. Example: @@ -14,4 +24,5 @@ Example: compatible = "virtio,mmio"; reg = <0x3000 0x100>; interrupts = <41>; + dma-coherent; } -- 2.11.0.dirty ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization