Re: [PATCH v3 02/12] iommu: Add iommu_split_block interface
On 4/20/21 3:32 PM, Keqian Zhu wrote: Hi Baolu, Cheers for the your quick reply. On 2021/4/20 10:09, Lu Baolu wrote: Hi Keqian, On 4/20/21 9:25 AM, Keqian Zhu wrote: Hi Baolu, On 2021/4/19 21:33, Lu Baolu wrote: Hi Keqian, On 2021/4/19 17:32, Keqian Zhu wrote: +EXPORT_SYMBOL_GPL(iommu_split_block); Do you really have any consumers of this interface other than the dirty bit tracking? If not, I don't suggest to make this as a generic IOMMU interface. There is an implicit requirement for such interfaces. The iommu_map/unmap(iova, size) shouldn't be called at the same time. Currently there's no such sanity check in the iommu core. A poorly written driver could mess up the kernel by misusing this interface. Yes, I don't think up a scenario except dirty tracking. Indeed, we'd better not make them as a generic interface. Do you have any suggestion that underlying iommu drivers can share these code but not make it as a generic iommu interface? I have a not so good idea. Make the "split" interfaces as a static function, and transfer the function pointer to start_dirty_log. But it looks weird and inflexible. I understand splitting/merging super pages is an optimization, but not a functional requirement. So is it possible to let the vendor iommu driver decide whether splitting super pages when starting dirty bit tracking and the opposite operation during when stopping it? The requirement for Right. If I understand you correct, actually that is what this series does. I mean to say no generic APIs, jut do it by the iommu subsystem itself. It's totally transparent to the upper level, just like what map() does. The upper layer doesn't care about either super page or small page is in use when do a mapping, right? If you want to consolidate some code, how about putting them in start/stop_tracking()? Yep, this reminds me. What we want to reuse is the logic of "chunk by chunk" in split(). We can implement switch_dirty_log to be "chunk by chunk" too (just the same as sync/clear), then the vendor iommu driver can invoke it's own private implementation of split(). So we can completely remove split() in the IOMMU core layer. example code logic iommu.c: switch_dirty_log(big range) { for_each_iommu_page(big range) { ops->switch_dirty_log(iommu_pgsize) } } vendor iommu driver: switch_dirty_log(iommu_pgsize) { if (enable) { ops->split_block(iommu_pgsize) /* And other actions, such as enable hardware capability */ } else { for_each_continuous_physical_address(iommu_pgsize) ops->merge_page() } } Besides, vendor iommu driver can invoke split() in clear_dirty_log instead of in switch_dirty_log. The benefit is that we usually clear dirty log gradually during dirty tracking, then we can split large page mapping gradually, which speedup start_dirty_log and make less side effect on DMA performance. Does it looks good for you? Yes. It's clearer now. Thanks, Keqian Best regards, baolu
Re: [PATCH v3 02/12] iommu: Add iommu_split_block interface
Hi Keqian, On 4/20/21 9:25 AM, Keqian Zhu wrote: Hi Baolu, On 2021/4/19 21:33, Lu Baolu wrote: Hi Keqian, On 2021/4/19 17:32, Keqian Zhu wrote: +EXPORT_SYMBOL_GPL(iommu_split_block); Do you really have any consumers of this interface other than the dirty bit tracking? If not, I don't suggest to make this as a generic IOMMU interface. There is an implicit requirement for such interfaces. The iommu_map/unmap(iova, size) shouldn't be called at the same time. Currently there's no such sanity check in the iommu core. A poorly written driver could mess up the kernel by misusing this interface. Yes, I don't think up a scenario except dirty tracking. Indeed, we'd better not make them as a generic interface. Do you have any suggestion that underlying iommu drivers can share these code but not make it as a generic iommu interface? I have a not so good idea. Make the "split" interfaces as a static function, and transfer the function pointer to start_dirty_log. But it looks weird and inflexible. I understand splitting/merging super pages is an optimization, but not a functional requirement. So is it possible to let the vendor iommu driver decide whether splitting super pages when starting dirty bit tracking and the opposite operation during when stopping it? The requirement for Right. If I understand you correct, actually that is what this series does. I mean to say no generic APIs, jut do it by the iommu subsystem itself. It's totally transparent to the upper level, just like what map() does. The upper layer doesn't care about either super page or small page is in use when do a mapping, right? If you want to consolidate some code, how about putting them in start/stop_tracking()? Best regards, baolu We realized split/merge in IOMMU core layer, but don't force vendor driver to use it. The problem is that when we expose these interfaces to vendor IOMMU driver, will also expose them to upper driver. upper layer is that starting/stopping dirty bit tracking and mapping/unmapping are mutually exclusive. OK, I will explicitly add the hints. Thanks. Thanks, Keqian On the other hand, if a driver calls map/unmap with split/merge at the same time, it's a bug of driver, it should follow the rule. Best regards, baolu .
Re: [PATCH v3 02/12] iommu: Add iommu_split_block interface
Hi Keqian, On 2021/4/19 17:32, Keqian Zhu wrote: +EXPORT_SYMBOL_GPL(iommu_split_block); Do you really have any consumers of this interface other than the dirty bit tracking? If not, I don't suggest to make this as a generic IOMMU interface. There is an implicit requirement for such interfaces. The iommu_map/unmap(iova, size) shouldn't be called at the same time. Currently there's no such sanity check in the iommu core. A poorly written driver could mess up the kernel by misusing this interface. Yes, I don't think up a scenario except dirty tracking. Indeed, we'd better not make them as a generic interface. Do you have any suggestion that underlying iommu drivers can share these code but not make it as a generic iommu interface? I have a not so good idea. Make the "split" interfaces as a static function, and transfer the function pointer to start_dirty_log. But it looks weird and inflexible. I understand splitting/merging super pages is an optimization, but not a functional requirement. So is it possible to let the vendor iommu driver decide whether splitting super pages when starting dirty bit tracking and the opposite operation during when stopping it? The requirement for upper layer is that starting/stopping dirty bit tracking and mapping/unmapping are mutually exclusive. On the other hand, if a driver calls map/unmap with split/merge at the same time, it's a bug of driver, it should follow the rule. Best regards, baolu
Re: [PATCH] iommu: Use passthrough mode for the Intel IPUs
Hi Bingbu, On 4/19/21 12:57 PM, Bingbu Cao wrote: Intel IPU(Image Processing Unit) has its own (IO)MMU hardware, The IPU driver allocates its own page table that is not mapped via the DMA, and thus the Intel IOMMU driver blocks access giving this error: DMAR: DRHD: handling fault status reg 3 DMAR: [DMA Read] Request device [00:05.0] PASID fault addr 76406000 [fault reason 06] PTE Read access is not set As IPU is not an external facing device which is not risky, so use IOMMU passthrough mode for Intel IPUs. As a quirk, does it need to be back-ported to stable kernels? If so, add Fixes tag and cc stable, please. Signed-off-by: Bingbu Cao --- drivers/iommu/intel/iommu.c | 35 +++ drivers/staging/media/ipu3/ipu3.c | 2 +- include/linux/pci_ids.h | 5 + 3 files changed, 41 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index ee0932307d64..59222d2fe73f 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -55,6 +55,12 @@ #define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY) #define IS_USB_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_SERIAL_USB) #define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA) +#define IS_IPU_DEVICE(pdev) ((pdev)->vendor == PCI_VENDOR_ID_INTEL && \ +((pdev)->device == PCI_DEVICE_ID_INTEL_IPU3 || \ +(pdev)->device == PCI_DEVICE_ID_INTEL_IPU6 || \ +(pdev)->device == PCI_DEVICE_ID_INTEL_IPU6SE || \ +(pdev)->device == PCI_DEVICE_ID_INTEL_IPU6SE_P || \ +(pdev)->device == PCI_DEVICE_ID_INTEL_IPU6EP)) #define IS_AZALIA(pdev) ((pdev)->vendor == 0x8086 && (pdev)->device == 0x3a3e) #define IOAPIC_RANGE_START (0xfee0) @@ -360,6 +366,7 @@ int intel_iommu_enabled = 0; EXPORT_SYMBOL_GPL(intel_iommu_enabled); static int dmar_map_gfx = 1; +static int dmar_map_ipu = 1; static int dmar_forcedac; static int intel_iommu_strict; static int intel_iommu_superpage = 1; @@ -368,6 +375,7 @@ static int iommu_skip_te_disable; #define IDENTMAP_GFX 2 #define IDENTMAP_AZALIA 4 +#define IDENTMAP_IPU 8 int intel_iommu_gfx_mapped; EXPORT_SYMBOL_GPL(intel_iommu_gfx_mapped); @@ -2839,6 +2847,9 @@ static int device_def_domain_type(struct device *dev) if ((iommu_identity_mapping & IDENTMAP_GFX) && IS_GFX_DEVICE(pdev)) return IOMMU_DOMAIN_IDENTITY; + + if ((iommu_identity_mapping & IDENTMAP_IPU) && IS_IPU_DEVICE(pdev)) + return IOMMU_DOMAIN_IDENTITY; } return 0; @@ -3278,6 +3289,9 @@ static int __init init_dmars(void) if (!dmar_map_gfx) iommu_identity_mapping |= IDENTMAP_GFX; + if (!dmar_map_ipu) + iommu_identity_mapping |= IDENTMAP_IPU; + check_tylersburg_isoch(); ret = si_domain_init(hw_pass_through); @@ -5622,6 +5636,15 @@ static void quirk_iommu_igfx(struct pci_dev *dev) dmar_map_gfx = 0; } +static void quirk_iommu_ipu(struct pci_dev *dev) +{ + if (risky_device(dev)) + return; + + pci_info(dev, "Passthrough IOMMU for integrated Intel IPU\n"); + dmar_map_ipu = 0; +} + /* G4x/GM45 integrated gfx dmar support is totally busted. */ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2a40, quirk_iommu_igfx); DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e00, quirk_iommu_igfx); @@ -5657,6 +5680,18 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1632, quirk_iommu_igfx); DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x163A, quirk_iommu_igfx); DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x163D, quirk_iommu_igfx); +/* disable IPU dmar support */ +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IPU3, +quirk_iommu_ipu); +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IPU6EP, +quirk_iommu_ipu); +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IPU6SE_P, +quirk_iommu_ipu); +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IPU6, +quirk_iommu_ipu); +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IPU6SE, +quirk_iommu_ipu); This is duplicate with above IS_IPU_DEVICE(). Please keep a single device list. + static void quirk_iommu_rwbf(struct pci_dev *dev) { if (risky_device(dev)) diff --git a/drivers/staging/media/ipu3/ipu3.c b/drivers/staging/media/ipu3/ipu3.c index ee1bba6bdcac..aee1130ac042 100644 --- a/drivers/staging/media/ipu3/ipu3.c +++ b/drivers/staging/media/ipu3/ipu3.c @@ -16,7 +16,7 @@ #include "ipu3-dmamap.h" #include "ipu3-mmu.h" -#define IMGU_PCI_ID 0x1919
Re: [PATCH v3 01/12] iommu: Introduce dirty log tracking framework
Hi Keqian, On 4/16/21 5:07 PM, Keqian Zhu wrote: I am worrying about having two sets of APIs for single purpose. From vendor iommu driver's point of view, this feature is per device. Hence, it still needs to do the same thing. Yes, we can unify the granule of feature reporting and status management. The basic granule of dirty tracking is iommu_domain, I think it's very reasonable. We need an interface to report the feature of iommu_domain, then the logic is much more clear. Every time we add new device or remove device from the domain, we should update the feature (e.g., maintain a counter of unsupported devices). Yes. This looks cleaner. What do you think about this idea? Thanks, Keqian Best regards, baolu
Re: [PATCH v3 01/12] iommu: Introduce dirty log tracking framework
Hi, On 2021/4/15 15:43, Keqian Zhu wrote: design it as not switchable. I will modify the commit message of patch#12, thanks! I am not sure that I fully get your point. But I can't see any gaps of using iommu_dev_enable/disable_feature() to switch dirty log on and off. Probably I missed anything. IOMMU_DEV_FEAT_HWDBM just tells user whether underlying IOMMU driver supports dirty tracking, it is not used to management the status of dirty log tracking. The feature reporting is per device, but the status management is per iommu_domain. Only when all devices in a domain support HWDBM, we can start dirty log for the domain. So why not for_each_device_attached_domain() iommu_dev_enable_feature(IOMMU_DEV_FEAT_HWDBM) ? And I think we'd better not mix the feature reporting and status management. Thoughts? I am worrying about having two sets of APIs for single purpose. From vendor iommu driver's point of view, this feature is per device. Hence, it still needs to do the same thing. Best regards, baolu
Re: [PATCH v3 01/12] iommu: Introduce dirty log tracking framework
On 4/15/21 2:18 PM, Keqian Zhu wrote: Hi Baolu, Thanks for the review! On 2021/4/14 15:00, Lu Baolu wrote: Hi Keqian, On 4/13/21 4:54 PM, Keqian Zhu wrote: Some types of IOMMU are capable of tracking DMA dirty log, such as ARM SMMU with HTTU or Intel IOMMU with SLADE. This introduces the dirty log tracking framework in the IOMMU base layer. Three new essential interfaces are added, and we maintaince the status of dirty log tracking in iommu_domain. 1. iommu_switch_dirty_log: Perform actions to start|stop dirty log tracking 2. iommu_sync_dirty_log: Sync dirty log from IOMMU into a dirty bitmap 3. iommu_clear_dirty_log: Clear dirty log of IOMMU by a mask bitmap A new dev feature are added to indicate whether a specific type of iommu hardware supports and its driver realizes them. Signed-off-by: Keqian Zhu Signed-off-by: Kunkun Jiang --- drivers/iommu/iommu.c | 150 ++ include/linux/iommu.h | 53 +++ 2 files changed, 203 insertions(+) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index d0b0a15dba84..667b2d6d2fc0 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1922,6 +1922,7 @@ static struct iommu_domain *__iommu_domain_alloc(struct bus_type *bus, domain->type = type; /* Assume all sizes by default; the driver may override this later */ domain->pgsize_bitmap = bus->iommu_ops->pgsize_bitmap; +mutex_init(>switch_log_lock); return domain; } @@ -2720,6 +2721,155 @@ int iommu_domain_set_attr(struct iommu_domain *domain, } EXPORT_SYMBOL_GPL(iommu_domain_set_attr); +int iommu_switch_dirty_log(struct iommu_domain *domain, bool enable, + unsigned long iova, size_t size, int prot) +{ +const struct iommu_ops *ops = domain->ops; +int ret; + +if (unlikely(!ops || !ops->switch_dirty_log)) +return -ENODEV; + +mutex_lock(>switch_log_lock); +if (enable && domain->dirty_log_tracking) { +ret = -EBUSY; +goto out; +} else if (!enable && !domain->dirty_log_tracking) { +ret = -EINVAL; +goto out; +} + +ret = ops->switch_dirty_log(domain, enable, iova, size, prot); +if (ret) +goto out; + +domain->dirty_log_tracking = enable; +out: +mutex_unlock(>switch_log_lock); +return ret; +} +EXPORT_SYMBOL_GPL(iommu_switch_dirty_log); Since you also added IOMMU_DEV_FEAT_HWDBM, I am wondering what's the difference between iommu_switch_dirty_log(on) vs. iommu_dev_enable_feature(IOMMU_DEV_FEAT_HWDBM) iommu_switch_dirty_log(off) vs. iommu_dev_disable_feature(IOMMU_DEV_FEAT_HWDBM) Indeed. As I can see, IOMMU_DEV_FEAT_AUX is not switchable, so enable/disable are not applicable for it. IOMMU_DEV_FEAT_SVA is switchable, so we can use these interfaces for it. IOMMU_DEV_FEAT_HWDBM is used to indicate whether hardware supports HWDBM, so we should Previously we had iommu_dev_has_feature() and then was cleaned up due to lack of real users. If you have a real case for it, why not bringing it back? design it as not switchable. I will modify the commit message of patch#12, thanks! I am not sure that I fully get your point. But I can't see any gaps of using iommu_dev_enable/disable_feature() to switch dirty log on and off. Probably I missed anything. + +int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova, + size_t size, unsigned long *bitmap, + unsigned long base_iova, unsigned long bitmap_pgshift) +{ +const struct iommu_ops *ops = domain->ops; +unsigned int min_pagesz; +size_t pgsize; +int ret = 0; + +if (unlikely(!ops || !ops->sync_dirty_log)) +return -ENODEV; + +min_pagesz = 1 << __ffs(domain->pgsize_bitmap); +if (!IS_ALIGNED(iova | size, min_pagesz)) { +pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n", + iova, size, min_pagesz); +return -EINVAL; +} + +mutex_lock(>switch_log_lock); +if (!domain->dirty_log_tracking) { +ret = -EINVAL; +goto out; +} + +while (size) { +pgsize = iommu_pgsize(domain, iova, size); + +ret = ops->sync_dirty_log(domain, iova, pgsize, + bitmap, base_iova, bitmap_pgshift); Any reason why do you want to do this in a per-4K page manner? This can lead to a lot of indirect calls and bad performance. How about a sync_dirty_pages()? The function name of iommu_pgsize() is a bit puzzling. Actually it will try to compute the max size that fit into size, so the pgsize can be a large page size even if the underlying mapping is 4K. The __iommu_unmap() also has a similar logic. This series has some improvement on the iommu_pgsize() helper. https://lore.kernel.org/linux-iommu/20210405191112.28192-1-isa...@codeaurora.org/ Best regards, baolu
Re: [PATCH v2] iommu/vt-d: Force to flush iotlb before creating superpage
Hi Longpeng, On 4/15/21 8:46 AM, Longpeng(Mike) wrote: The translation caches may preserve obsolete data when the mapping size is changed, suppose the following sequence which can reveal the problem with high probability. 1.mmap(4GB,MAP_HUGETLB) 2. while (1) { (a)DMA MAP 0,0xa (b)DMA UNMAP 0,0xa (c)DMA MAP 0,0xc000 * DMA read IOVA 0 may failure here (Not present) * if the problem occurs. (d)DMA UNMAP 0,0xc000 } The page table(only focus on IOVA 0) after (a) is: PML4: 0x19db5c1003 entry:0x899bdcd2f000 PDPE: 0x1a1cacb003 entry:0x89b35b5c1000 PDE: 0x1a30a72003 entry:0x89b39cacb000 PTE: 0x21d200803 entry:0x89b3b0a72000 The page table after (b) is: PML4: 0x19db5c1003 entry:0x899bdcd2f000 PDPE: 0x1a1cacb003 entry:0x89b35b5c1000 PDE: 0x1a30a72003 entry:0x89b39cacb000 PTE: 0x0 entry:0x89b3b0a72000 The page table after (c) is: PML4: 0x19db5c1003 entry:0x899bdcd2f000 PDPE: 0x1a1cacb003 entry:0x89b35b5c1000 PDE: 0x21d200883 entry:0x89b39cacb000 (*) Because the PDE entry after (b) is present, it won't be flushed even if the iommu driver flush cache when unmap, so the obsolete data may be preserved in cache, which would cause the wrong translation at end. However, we can see the PDE entry is finally switch to 2M-superpage mapping, but it does not transform to 0x21d200883 directly: 1. PDE: 0x1a30a72003 2. __domain_mapping dma_pte_free_pagetable Set the PDE entry to ZERO Set the PDE entry to 0x21d200883 So we must flush the cache after the entry switch to ZERO to avoid the obsolete info be preserved. Cc: David Woodhouse Cc: Lu Baolu Cc: Nadav Amit Cc: Alex Williamson Cc: Joerg Roedel Cc: Kevin Tian Cc: Gonglei (Arei) Fixes: 6491d4d02893 ("intel-iommu: Free old page tables before creating superpage") Cc: # v3.0+ Link: https://lore.kernel.org/linux-iommu/670baaf8-4ff8-4e84-4be3-030b95ab5...@huawei.com/ Suggested-by: Lu Baolu Signed-off-by: Longpeng(Mike) --- v1 -> v2: - add Joerg - reconstruct the solution base on the Baolu's suggestion --- drivers/iommu/intel/iommu.c | 52 + 1 file changed, 38 insertions(+), 14 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index ee09323..881c9f2 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -2289,6 +2289,41 @@ static inline int hardware_largepage_caps(struct dmar_domain *domain, return level; } +/* + * Ensure that old small page tables are removed to make room for superpage(s). + * We're going to add new large pages, so make sure we don't remove their parent + * tables. The IOTLB/devTLBs should be flushed if any PDE/PTEs are cleared. + */ +static void switch_to_super_page(struct dmar_domain *domain, +unsigned long start_pfn, +unsigned long end_pfn, int level) +{ + unsigned long lvl_pages = lvl_to_nr_pages(level); + struct dma_pte *pte = NULL; + int i; + + while (start_pfn <= end_pfn) { + if (!pte) + pte = pfn_to_dma_pte(domain, start_pfn, ); + + if (dma_pte_present(pte)) { + dma_pte_free_pagetable(domain, start_pfn, + start_pfn + lvl_pages - 1, + level + 1); + + for_each_domain_iommu(i, domain) + iommu_flush_iotlb_psi(g_iommus[i], domain, + start_pfn, lvl_pages, + 0, 0); + } + + pte++; + start_pfn += lvl_pages; + if (first_pte_in_page(pte)) + pte = NULL; + } +} + static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, unsigned long phys_pfn, unsigned long nr_pages, int prot) @@ -2329,22 +2364,11 @@ static inline int hardware_largepage_caps(struct dmar_domain *domain, return -ENOMEM; /* It is large page*/ if (largepage_lvl > 1) { - unsigned long nr_superpages, end_pfn; + unsigned long end_pfn; pteval |= DMA_PTE_LARGE_PAGE; - lvl_pages = lvl_to_nr_pages(largepage_lvl); - - nr_superpages = nr_pages / lvl_pages; - end_pfn = iov_pfn + nr_superpages * lvl_pages - 1; - - /* -* Ensure that old small page tables are -* removed to make r
Re: [PATCH 2/2] iommu/sva: Remove mm parameter from SVA bind API
Hi Jason, On 4/14/21 7:26 PM, Jason Gunthorpe wrote: On Wed, Apr 14, 2021 at 02:22:09PM +0800, Lu Baolu wrote: I still worry about supervisor pasid allocation. If we use iommu_sva_alloc_pasid() to allocate a supervisor pasid, which mm should the pasid be set? I've ever thought about passing _mm to iommu_sva_alloc_pasid(). But if you add "mm != current->mm", this seems not to work. Or do you prefer a separated interface for supervisor pasid allocation/free? Without a mm_struct it is not SVA, so don't use SVA APIs for whatever a 'supervisor pasid' is The supervisor PASID has its mm_struct. The only difference is that the device will set priv=1 in its DMA transactions with the PASID. Best regards, baolu
Re: [PATCH v3 02/12] iommu: Add iommu_split_block interface
On 4/13/21 4:54 PM, Keqian Zhu wrote: Block(largepage) mapping is not a proper granule for dirty log tracking. Take an extreme example, if DMA writes one byte, under 1G mapping, the dirty amount reported is 1G, but under 4K mapping, the dirty amount is just 4K. This adds a new interface named iommu_split_block in IOMMU base layer. A specific IOMMU driver can invoke it during start dirty log. If so, the driver also need to realize the split_block iommu ops. We flush all iotlbs after the whole procedure is completed to ease the pressure of IOMMU, as we will hanle a huge range of mapping in general. Signed-off-by: Keqian Zhu Signed-off-by: Kunkun Jiang --- drivers/iommu/iommu.c | 41 + include/linux/iommu.h | 11 +++ 2 files changed, 52 insertions(+) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 667b2d6d2fc0..bb413a927870 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -2721,6 +2721,47 @@ int iommu_domain_set_attr(struct iommu_domain *domain, } EXPORT_SYMBOL_GPL(iommu_domain_set_attr); +int iommu_split_block(struct iommu_domain *domain, unsigned long iova, + size_t size) +{ + const struct iommu_ops *ops = domain->ops; + unsigned int min_pagesz; + size_t pgsize; + bool flush = false; + int ret = 0; + + if (unlikely(!ops || !ops->split_block)) + return -ENODEV; + + min_pagesz = 1 << __ffs(domain->pgsize_bitmap); + if (!IS_ALIGNED(iova | size, min_pagesz)) { + pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n", + iova, size, min_pagesz); + return -EINVAL; + } + + while (size) { + flush = true; + + pgsize = iommu_pgsize(domain, iova, size); + + ret = ops->split_block(domain, iova, pgsize); + if (ret) + break; + + pr_debug("split handled: iova 0x%lx size 0x%zx\n", iova, pgsize); + + iova += pgsize; + size -= pgsize; + } + + if (flush) + iommu_flush_iotlb_all(domain); + + return ret; +} +EXPORT_SYMBOL_GPL(iommu_split_block); Do you really have any consumers of this interface other than the dirty bit tracking? If not, I don't suggest to make this as a generic IOMMU interface. There is an implicit requirement for such interfaces. The iommu_map/unmap(iova, size) shouldn't be called at the same time. Currently there's no such sanity check in the iommu core. A poorly written driver could mess up the kernel by misusing this interface. This also applies to iommu_merge_page(). Best regards, baolu
Re: [PATCH v3 01/12] iommu: Introduce dirty log tracking framework
Hi Keqian, On 4/13/21 4:54 PM, Keqian Zhu wrote: Some types of IOMMU are capable of tracking DMA dirty log, such as ARM SMMU with HTTU or Intel IOMMU with SLADE. This introduces the dirty log tracking framework in the IOMMU base layer. Three new essential interfaces are added, and we maintaince the status of dirty log tracking in iommu_domain. 1. iommu_switch_dirty_log: Perform actions to start|stop dirty log tracking 2. iommu_sync_dirty_log: Sync dirty log from IOMMU into a dirty bitmap 3. iommu_clear_dirty_log: Clear dirty log of IOMMU by a mask bitmap A new dev feature are added to indicate whether a specific type of iommu hardware supports and its driver realizes them. Signed-off-by: Keqian Zhu Signed-off-by: Kunkun Jiang --- drivers/iommu/iommu.c | 150 ++ include/linux/iommu.h | 53 +++ 2 files changed, 203 insertions(+) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index d0b0a15dba84..667b2d6d2fc0 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1922,6 +1922,7 @@ static struct iommu_domain *__iommu_domain_alloc(struct bus_type *bus, domain->type = type; /* Assume all sizes by default; the driver may override this later */ domain->pgsize_bitmap = bus->iommu_ops->pgsize_bitmap; + mutex_init(>switch_log_lock); return domain; } @@ -2720,6 +2721,155 @@ int iommu_domain_set_attr(struct iommu_domain *domain, } EXPORT_SYMBOL_GPL(iommu_domain_set_attr); +int iommu_switch_dirty_log(struct iommu_domain *domain, bool enable, + unsigned long iova, size_t size, int prot) +{ + const struct iommu_ops *ops = domain->ops; + int ret; + + if (unlikely(!ops || !ops->switch_dirty_log)) + return -ENODEV; + + mutex_lock(>switch_log_lock); + if (enable && domain->dirty_log_tracking) { + ret = -EBUSY; + goto out; + } else if (!enable && !domain->dirty_log_tracking) { + ret = -EINVAL; + goto out; + } + + ret = ops->switch_dirty_log(domain, enable, iova, size, prot); + if (ret) + goto out; + + domain->dirty_log_tracking = enable; +out: + mutex_unlock(>switch_log_lock); + return ret; +} +EXPORT_SYMBOL_GPL(iommu_switch_dirty_log); Since you also added IOMMU_DEV_FEAT_HWDBM, I am wondering what's the difference between iommu_switch_dirty_log(on) vs. iommu_dev_enable_feature(IOMMU_DEV_FEAT_HWDBM) iommu_switch_dirty_log(off) vs. iommu_dev_disable_feature(IOMMU_DEV_FEAT_HWDBM) + +int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova, +size_t size, unsigned long *bitmap, +unsigned long base_iova, unsigned long bitmap_pgshift) +{ + const struct iommu_ops *ops = domain->ops; + unsigned int min_pagesz; + size_t pgsize; + int ret = 0; + + if (unlikely(!ops || !ops->sync_dirty_log)) + return -ENODEV; + + min_pagesz = 1 << __ffs(domain->pgsize_bitmap); + if (!IS_ALIGNED(iova | size, min_pagesz)) { + pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n", + iova, size, min_pagesz); + return -EINVAL; + } + + mutex_lock(>switch_log_lock); + if (!domain->dirty_log_tracking) { + ret = -EINVAL; + goto out; + } + + while (size) { + pgsize = iommu_pgsize(domain, iova, size); + + ret = ops->sync_dirty_log(domain, iova, pgsize, + bitmap, base_iova, bitmap_pgshift); Any reason why do you want to do this in a per-4K page manner? This can lead to a lot of indirect calls and bad performance. How about a sync_dirty_pages()? The same comment applies to other places in this patch series. + if (ret) + break; + + pr_debug("dirty_log_sync handle: iova 0x%lx pagesz 0x%zx\n", +iova, pgsize); + + iova += pgsize; + size -= pgsize; + } +out: + mutex_unlock(>switch_log_lock); + return ret; +} +EXPORT_SYMBOL_GPL(iommu_sync_dirty_log); + +static int __iommu_clear_dirty_log(struct iommu_domain *domain, + unsigned long iova, size_t size, + unsigned long *bitmap, + unsigned long base_iova, + unsigned long bitmap_pgshift) +{ + const struct iommu_ops *ops = domain->ops; + size_t pgsize; + int ret = 0; + + if (unlikely(!ops || !ops->clear_dirty_log)) + return -ENODEV; + + while (size) { + pgsize = iommu_pgsize(domain, iova, size); + + ret = ops->clear_dirty_log(domain, iova, pgsize, bitmap, +
Re: [PATCH 2/2] iommu/sva: Remove mm parameter from SVA bind API
Hi Jacob, On 4/14/21 8:09 AM, Jacob Pan wrote: Hi Jean, On Fri, 9 Apr 2021 11:03:05 -0700, Jacob Pan wrote: problems: * We don't have a use-case for binding the mm of a remote process (and it's supposedly difficult for device drivers to do it securely). So OK, we remove the mm argument from iommu_sva_bind_device() and use the current mm. But the IOMMU driver isn't going to do get_task_mm(current) every time it needs the mm being bound, it will take it from iommu_sva_bind_device(). Likewise iommu_sva_alloc_pasid() shouldn't need to bother with get_task_mm(). * cgroup accounting for IOASIDs needs to be on the current task. Removing the mm parameter from iommu_sva_alloc_pasid() doesn't help with that. Sure it indicates that iommu_sva_alloc_pasid() needs a specific task context but that's only for cgroup purpose, and I'd rather pass the cgroup down from iommu_sva_bind_device() anyway (but am fine with keeping it within ioasid_alloc() for now). Plus it's an internal helper, easy for us to check that the callers are doing the right thing. With the above split, we really just have one allocation function: ioasid_alloc(), so it can manage current cgroup accounting within. Would this work? After a few attempts, I don't think the split can work better. I will restore the mm parameter and add a warning if mm != current->mm. I still worry about supervisor pasid allocation. If we use iommu_sva_alloc_pasid() to allocate a supervisor pasid, which mm should the pasid be set? I've ever thought about passing _mm to iommu_sva_alloc_pasid(). But if you add "mm != current->mm", this seems not to work. Or do you prefer a separated interface for supervisor pasid allocation/free? Best regards, baolu Thanks, Jacob
Re: [PATCH][next] iommu/vt-d: Fix out-bounds-warning in intel_svm_page_response()
Hi Gustavo, On 4/14/21 3:54 AM, Gustavo A. R. Silva wrote: Replace call to memcpy() with just a couple of simple assignments in order to fix the following out-of-bounds warning: drivers/iommu/intel/svm.c:1198:4: warning: 'memcpy' offset [25, 32] from the object at 'desc' is out of the bounds of referenced subobject 'qw2' with type 'long long unsigned int' at offset 16 [-Warray-bounds] The problem is that the original code is trying to copy data into a couple of struct members adjacent to each other in a single call to memcpy(). This causes a legitimate compiler warning because memcpy() overruns the length of This helps with the ongoing efforts to globally enable -Warray-bounds and get us closer to being able to tighten the FORTIFY_SOURCE routines on memcpy(). Link: https://github.com/KSPP/linux/issues/109 Signed-off-by: Gustavo A. R. Silva --- drivers/iommu/intel/svm.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index 5165cea90421..65909f504c50 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -1194,9 +1194,10 @@ int intel_svm_page_response(struct device *dev, desc.qw1 = QI_PGRP_IDX(prm->grpid) | QI_PGRP_LPIG(last_page); desc.qw2 = 0; desc.qw3 = 0; - if (private_present) - memcpy(, prm->private_data, - sizeof(prm->private_data)); The same memcpy() is used in multiple places in this file. Did they compile the same warnings? Or there are multiple patches to fix them one by one? Best regards, baolu + if (private_present) { + desc.qw2 = prm->private_data[0]; + desc.qw3 = prm->private_data[1]; + } qi_submit_sync(iommu, , 1, 0); }
Re: [PATCH 5.4 v2 1/1] iommu/vt-d: Fix agaw for a supported 48 bit guest address width
I guess you need to ask Greg KH with this Cc-ing to sta...@vger.kernel.org. Best regards, baolu On 2021/4/12 3:36, Saeed Mirzamohammadi wrote: Hi Lu, Thanks for the review. May I know when do we expect this to be applied to 5.4? Thanks, Saeed On Apr 7, 2021, at 5:25 PM, Lu Baolu <mailto:baolu...@linux.intel.com>> wrote: On 4/8/21 2:40 AM, Saeed Mirzamohammadi wrote: The IOMMU driver calculates the guest addressability for a DMA request based on the value of the mgaw reported from the IOMMU. However, this is a fused value and as mentioned in the spec, the guest width should be calculated based on the minimum of supported adjusted guest address width (SAGAW) and MGAW. This is from specification: "Guest addressability for a given DMA request is limited to the minimum of the value reported through this field and the adjusted guest address width of the corresponding page-table structure. (Adjusted guest address widths supported by hardware are reported through the SAGAW field)." This causes domain initialization to fail and following errors appear for EHCI PCI driver: [ 2.486393] ehci-pci :01:00.4: EHCI Host Controller [ 2.486624] ehci-pci :01:00.4: new USB bus registered, assigned bus number 1 [ 2.489127] ehci-pci :01:00.4: DMAR: Allocating domain failed [ 2.489350] ehci-pci :01:00.4: DMAR: 32bit DMA uses non-identity mapping [ 2.489359] ehci-pci :01:00.4: can't setup: -12 [ 2.489531] ehci-pci :01:00.4: USB bus 1 deregistered [ 2.490023] ehci-pci :01:00.4: init :01:00.4 fail, -12 [ 2.490358] ehci-pci: probe of :01:00.4 failed with error -12 This issue happens when the value of the sagaw corresponds to a 48-bit agaw. This fix updates the calculation of the agaw based on the minimum of IOMMU's sagaw value and MGAW. Signed-off-by: Saeed Mirzamohammadi <mailto:saeed.mirzamohamm...@oracle.com>> Tested-by: Camille Lu mailto:camille...@hpe.com>> --- Change in v2: - Added cap_width to calculate AGAW based on the minimum value of MGAW and AGAW. --- drivers/iommu/intel-iommu.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 953d86ca6d2b..a2a03df97704 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -1853,7 +1853,7 @@ static inline int guestwidth_to_adjustwidth(int gaw) static int domain_init(struct dmar_domain *domain, struct intel_iommu *iommu, int guest_width) { -int adjust_width, agaw; +int adjust_width, agaw, cap_width; unsigned long sagaw; int err; @@ -1867,8 +1867,9 @@ static int domain_init(struct dmar_domain *domain, struct intel_iommu *iommu, domain_reserve_special_ranges(domain); /* calculate AGAW */ -if (guest_width > cap_mgaw(iommu->cap)) -guest_width = cap_mgaw(iommu->cap); +cap_width = min_t(int, cap_mgaw(iommu->cap), agaw_to_width(iommu->agaw)); +if (guest_width > cap_width) +guest_width = cap_width; domain->gaw = guest_width; adjust_width = guestwidth_to_adjustwidth(guest_width); agaw = width_to_agaw(adjust_width); Reviewed-by: Lu Baolu <mailto:baolu...@linux.intel.com>> Best regards, baolu
Re: [PATCH] iommu/vt-d: Fix an error handling path in 'intel_prepare_irq_remapping()'
On 4/11/21 3:08 PM, Christophe JAILLET wrote: If 'intel_cap_audit()' fails, we should return directly, as already done in the surrounding error handling path. Fixes: ad3d19029979 ("iommu/vt-d: Audit IOMMU Capabilities and add helper functions") Signed-off-by: Christophe JAILLET --- This patch is completely speculative. It is based on a spurious mix-up of direct return and goto. --- drivers/iommu/intel/irq_remapping.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c index 75429a5373ec..f912fe45bea2 100644 --- a/drivers/iommu/intel/irq_remapping.c +++ b/drivers/iommu/intel/irq_remapping.c @@ -736,7 +736,7 @@ static int __init intel_prepare_irq_remapping(void) return -ENODEV; if (intel_cap_audit(CAP_AUDIT_STATIC_IRQR, NULL)) - goto error; + return -ENODEV; if (!dmar_ir_support()) return -ENODEV; Thanks! Acked-by: Lu Baolu Best regards, baolu
[PATCH 1/1] iommu/vt-d: Fix build error of pasid_enable_wpe() with !X86
Commit f68c7f539b6e9 ("iommu/vt-d: Enable write protect for supervisor SVM") added pasid_enable_wpe() which hit below compile error with !X86. ../drivers/iommu/intel/pasid.c: In function 'pasid_enable_wpe': ../drivers/iommu/intel/pasid.c:554:22: error: implicit declaration of function 'read_cr0' [-Werror=implicit-function-declaration] 554 | unsigned long cr0 = read_cr0(); | ^~~~ In file included from ../include/linux/build_bug.h:5, from ../include/linux/bits.h:22, from ../include/linux/bitops.h:6, from ../drivers/iommu/intel/pasid.c:12: ../drivers/iommu/intel/pasid.c:557:23: error: 'X86_CR0_WP' undeclared (first use in this function) 557 | if (unlikely(!(cr0 & X86_CR0_WP))) { | ^~ ../include/linux/compiler.h:78:42: note: in definition of macro 'unlikely' 78 | # define unlikely(x) __builtin_expect(!!(x), 0) | ^ ../drivers/iommu/intel/pasid.c:557:23: note: each undeclared identifier is reported only once for each function it appears in 557 | if (unlikely(!(cr0 & X86_CR0_WP))) { | ^~ ../include/linux/compiler.h:78:42: note: in definition of macro 'unlikely' 78 | # define unlikely(x) __builtin_expect(!!(x), 0) | Add the missing dependency. Cc: Sanjay Kumar Cc: Jacob Pan Cc: Randy Dunlap Reported-by: kernel test robot Reported-by: Randy Dunlap Fixes: f68c7f539b6e9 ("iommu/vt-d: Enable write protect for supervisor SVM") Signed-off-by: Lu Baolu --- drivers/iommu/intel/pasid.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 477b2e1d303c..72646bafc52f 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -551,6 +551,7 @@ static void pasid_flush_caches(struct intel_iommu *iommu, static inline int pasid_enable_wpe(struct pasid_entry *pte) { +#ifdef CONFIG_X86 unsigned long cr0 = read_cr0(); /* CR0.WP is normally set but just to be sure */ @@ -558,6 +559,7 @@ static inline int pasid_enable_wpe(struct pasid_entry *pte) pr_err_ratelimited("No CPU write protect!\n"); return -EINVAL; } +#endif pasid_set_wpe(pte); return 0; -- 2.25.1
Re: [PATCH 2/2] iommu/sva: Remove mm parameter from SVA bind API
Hi, On 2021/4/9 1:08, Jacob Pan wrote: /** * iommu_sva_alloc_pasid - Allocate a PASID for the mm - * @mm: the mm * @min: minimum PASID value (inclusive) * @max: maximum PASID value (inclusive) * - * Try to allocate a PASID for this mm, or take a reference to the existing one - * provided it fits within the [@min, @max] range. On success the PASID is - * available in mm->pasid, and must be released with iommu_sva_free_pasid(). + * Try to allocate a PASID for the current mm, or take a reference to the + * existing one provided it fits within the [@min, @max] range. On success + * the PASID is available in the current mm->pasid, and must be released with + * iommu_sva_free_pasid(). * @min must be greater than 0, because 0 indicates an unused mm->pasid. * * Returns 0 on success and < 0 on error. */ -int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max) +int iommu_sva_alloc_pasid(ioasid_t min, ioasid_t max) { int ret = 0; ioasid_t pasid; + struct mm_struct *mm; if (min == INVALID_IOASID || max == INVALID_IOASID || min == 0 || max < min) return -EINVAL; mutex_lock(_sva_lock); + mm = get_task_mm(current); How could we allocate a supervisor PASID through iommu_sva_alloc_pasid() if we always use current->mm here? + if (!mm) { + ret = -EINVAL; + goto out_unlock; + } if (mm->pasid) { if (mm->pasid >= min && mm->pasid <= max) ioasid_get(mm->pasid); @@ -45,22 +51,32 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max) else mm->pasid = pasid; } + mmput(mm); +out_unlock: mutex_unlock(_sva_lock); return ret; } EXPORT_SYMBOL_GPL(iommu_sva_alloc_pasid); Best regards, baolu
Re: [PATCH 1/2] iommu/sva: Tighten SVA bind API with explicit flags
Hi Jacob, On 2021/4/9 1:08, Jacob Pan wrote: The void* drvdata parameter isn't really used in iommu_sva_bind_device() API, the current IDXD code "borrows" the drvdata for a VT-d private flag for supervisor SVA usage. Supervisor/Privileged mode request is a generic feature. It should be promoted from the VT-d vendor driver to the generic code. This patch replaces void* drvdata with a unsigned int flags parameter and adjusts callers accordingly. Link: https://lore.kernel.org/linux-iommu/YFhiMLR35WWMW%2FHu@myrica/ Suggested-by: Jean-Philippe Brucker Signed-off-by: Jacob Pan --- drivers/dma/idxd/cdev.c | 2 +- drivers/dma/idxd/init.c | 6 +++--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 2 +- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 4 ++-- drivers/iommu/intel/Kconfig | 1 + drivers/iommu/intel/svm.c | 18 ++ drivers/iommu/iommu.c | 9 ++--- drivers/misc/uacce/uacce.c | 2 +- include/linux/intel-iommu.h | 2 +- include/linux/intel-svm.h | 17 ++--- include/linux/iommu.h | 19 --- 11 files changed, 40 insertions(+), 42 deletions(-) diff --git a/drivers/dma/idxd/cdev.c b/drivers/dma/idxd/cdev.c index 0db9b82..21ec82b 100644 --- a/drivers/dma/idxd/cdev.c +++ b/drivers/dma/idxd/cdev.c @@ -103,7 +103,7 @@ static int idxd_cdev_open(struct inode *inode, struct file *filp) filp->private_data = ctx; if (device_pasid_enabled(idxd)) { - sva = iommu_sva_bind_device(dev, current->mm, NULL); + sva = iommu_sva_bind_device(dev, current->mm, 0); if (IS_ERR(sva)) { rc = PTR_ERR(sva); dev_err(dev, "pasid allocation failed: %d\n", rc); diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c index 085a0c3..cdc85f1 100644 --- a/drivers/dma/idxd/init.c +++ b/drivers/dma/idxd/init.c @@ -300,13 +300,13 @@ static struct idxd_device *idxd_alloc(struct pci_dev *pdev) static int idxd_enable_system_pasid(struct idxd_device *idxd) { - int flags; + unsigned int flags; unsigned int pasid; struct iommu_sva *sva; - flags = SVM_FLAG_SUPERVISOR_MODE; + flags = IOMMU_SVA_BIND_SUPERVISOR; With SVM_FLAG_SUPERVISOR_MODE removed, I guess #include in this file could be removed as well. - sva = iommu_sva_bind_device(>pdev->dev, NULL, ); + sva = iommu_sva_bind_device(>pdev->dev, NULL, flags); if (IS_ERR(sva)) { dev_warn(>pdev->dev, "iommu sva bind failed: %ld\n", PTR_ERR(sva)); diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c index bb251ca..23e287e 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c @@ -354,7 +354,7 @@ __arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm) } struct iommu_sva * -arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, void *drvdata) +arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, unsigned int flags) { struct iommu_sva *handle; struct iommu_domain *domain = iommu_get_domain_for_dev(dev); diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index f985817..b971d4d 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -711,7 +711,7 @@ bool arm_smmu_master_sva_enabled(struct arm_smmu_master *master); int arm_smmu_master_enable_sva(struct arm_smmu_master *master); int arm_smmu_master_disable_sva(struct arm_smmu_master *master); struct iommu_sva *arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, - void *drvdata); + unsigned int flags); void arm_smmu_sva_unbind(struct iommu_sva *handle); u32 arm_smmu_sva_get_pasid(struct iommu_sva *handle); void arm_smmu_sva_notifier_synchronize(void); @@ -742,7 +742,7 @@ static inline int arm_smmu_master_disable_sva(struct arm_smmu_master *master) } static inline struct iommu_sva * -arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, void *drvdata) +arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, unsigned int flags) { return ERR_PTR(-ENODEV); } diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig index 28a3d15..5415052 100644 --- a/drivers/iommu/intel/Kconfig +++ b/drivers/iommu/intel/Kconfig @@ -41,6 +41,7 @@ config INTEL_IOMMU_SVM select PCI_PRI select MMU_NOTIFIER select IOASID + select IOMMU_SVA_LIB Why? help Shared Virtual Memory (SVM) provides a facility for devices to access DMA
Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage
Hi Longpeng, On 4/8/21 3:37 PM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: Hi Baolu, -Original Message- From: Lu Baolu [mailto:baolu...@linux.intel.com] Sent: Thursday, April 8, 2021 12:32 PM To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) ; io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org Cc: baolu...@linux.intel.com; David Woodhouse ; Nadav Amit ; Alex Williamson ; Kevin Tian ; Gonglei (Arei) ; sta...@vger.kernel.org Subject: Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage Hi Longpeng, On 4/7/21 2:35 PM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: Hi Baolu, -Original Message- From: Lu Baolu [mailto:baolu...@linux.intel.com] Sent: Friday, April 2, 2021 12:44 PM To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) ; io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org Cc: baolu...@linux.intel.com; David Woodhouse ; Nadav Amit ; Alex Williamson ; Kevin Tian ; Gonglei (Arei) ; sta...@vger.kernel.org Subject: Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage Hi Longpeng, On 4/1/21 3:18 PM, Longpeng(Mike) wrote: diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index ee09323..cbcb434 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -2342,9 +2342,20 @@ static inline int hardware_largepage_caps(struct dmar_domain *domain, * removed to make room for superpage(s). * We're adding new large pages, so make sure * we don't remove their parent tables. +* +* We also need to flush the iotlb before creating +* superpage to ensure it does not perserves any +* obsolete info. */ - dma_pte_free_pagetable(domain, iov_pfn, end_pfn, - largepage_lvl + 1); + if (dma_pte_present(pte)) { The dma_pte_free_pagetable() clears a batch of PTEs. So checking current PTE is insufficient. How about removing this check and always performing cache invalidation? Um...the PTE here may be present( e.g. 4K mapping --> superpage mapping ) orNOT-present ( e.g. create a totally new superpage mapping ), but we only need to call free_pagetable and flush_iotlb in the former case, right ? But this code covers multiple PTEs and perhaps crosses the page boundary. How about moving this code into a separated function and check PTE presence there. A sample code could look like below: [compiled but not tested!] diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index d334f5b4e382..0e04d450c38a 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -2300,6 +2300,41 @@ static inline int hardware_largepage_caps(struct dmar_domain *domain, return level; } +/* + * Ensure that old small page tables are removed to make room for superpage(s). + * We're going to add new large pages, so make sure we don't remove their parent + * tables. The IOTLB/devTLBs should be flushed if any PDE/PTEs are cleared. + */ +static void switch_to_super_page(struct dmar_domain *domain, +unsigned long start_pfn, +unsigned long end_pfn, int level) { Maybe "swith_to" will lead people to think "remove old and then setup new", so how about something like "remove_room_for_super_page" or "prepare_for_super_page" ? I named it like this because we also want to have a opposite operation split_from_super_page() which switch a PDE or PDPE from super page setting up to small pages, which is needed to optimize dirty bit tracking during VM live migration. + unsigned long lvl_pages = lvl_to_nr_pages(level); + struct dma_pte *pte = NULL; + int i; + + while (start_pfn <= end_pfn) { start_pfn < end_pfn ? end_pfn is inclusive. + if (!pte) + pte = pfn_to_dma_pte(domain, start_pfn, ); + + if (dma_pte_present(pte)) { + dma_pte_free_pagetable(domain, start_pfn, + start_pfn + lvl_pages - 1, + level + 1); + + for_each_domain_iommu(i, domain) + iommu_flush_iotlb_psi(g_iommus[i], domain, + start_pfn, lvl_pages, + 0, 0); + } + + pte++; + start_pfn += lvl_pages; + if (first_pte_in_page(pte)) + pte = NULL; + } +} +
Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage
Hi Longpeng, On 4/7/21 2:35 PM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: Hi Baolu, -Original Message- From: Lu Baolu [mailto:baolu...@linux.intel.com] Sent: Friday, April 2, 2021 12:44 PM To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) ; io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org Cc: baolu...@linux.intel.com; David Woodhouse ; Nadav Amit ; Alex Williamson ; Kevin Tian ; Gonglei (Arei) ; sta...@vger.kernel.org Subject: Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage Hi Longpeng, On 4/1/21 3:18 PM, Longpeng(Mike) wrote: diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index ee09323..cbcb434 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -2342,9 +2342,20 @@ static inline int hardware_largepage_caps(struct dmar_domain *domain, * removed to make room for superpage(s). * We're adding new large pages, so make sure * we don't remove their parent tables. +* +* We also need to flush the iotlb before creating +* superpage to ensure it does not perserves any +* obsolete info. */ - dma_pte_free_pagetable(domain, iov_pfn, end_pfn, - largepage_lvl + 1); + if (dma_pte_present(pte)) { The dma_pte_free_pagetable() clears a batch of PTEs. So checking current PTE is insufficient. How about removing this check and always performing cache invalidation? Um...the PTE here may be present( e.g. 4K mapping --> superpage mapping ) orNOT-present ( e.g. create a totally new superpage mapping ), but we only need to call free_pagetable and flush_iotlb in the former case, right ? But this code covers multiple PTEs and perhaps crosses the page boundary. How about moving this code into a separated function and check PTE presence there. A sample code could look like below: [compiled but not tested!] diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index d334f5b4e382..0e04d450c38a 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -2300,6 +2300,41 @@ static inline int hardware_largepage_caps(struct dmar_domain *domain, return level; } +/* + * Ensure that old small page tables are removed to make room for superpage(s). + * We're going to add new large pages, so make sure we don't remove their parent + * tables. The IOTLB/devTLBs should be flushed if any PDE/PTEs are cleared. + */ +static void switch_to_super_page(struct dmar_domain *domain, +unsigned long start_pfn, +unsigned long end_pfn, int level) +{ + unsigned long lvl_pages = lvl_to_nr_pages(level); + struct dma_pte *pte = NULL; + int i; + + while (start_pfn <= end_pfn) { + if (!pte) + pte = pfn_to_dma_pte(domain, start_pfn, ); + + if (dma_pte_present(pte)) { + dma_pte_free_pagetable(domain, start_pfn, + start_pfn + lvl_pages - 1, + level + 1); + + for_each_domain_iommu(i, domain) + iommu_flush_iotlb_psi(g_iommus[i], domain, + start_pfn, lvl_pages, + 0, 0); + } + + pte++; + start_pfn += lvl_pages; + if (first_pte_in_page(pte)) + pte = NULL; + } +} + static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, unsigned long phys_pfn, unsigned long nr_pages, int prot) @@ -2341,22 +2376,11 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, return -ENOMEM; /* It is large page*/ if (largepage_lvl > 1) { - unsigned long nr_superpages, end_pfn; + unsigned long end_pfn; pteval |= DMA_PTE_LARGE_PAGE; - lvl_pages = lvl_to_nr_pages(largepage_lvl); - - nr_superpages = nr_pages / lvl_pages; - end_pfn = iov_pfn + nr_superpages * lvl_pages - 1; - - /* -* Ensure that old small page tables are -* removed to make room for superpage(s). -* We're adding new large pages, so m
Re: [PATCH 5.4 v2 1/1] iommu/vt-d: Fix agaw for a supported 48 bit guest address width
On 4/8/21 2:40 AM, Saeed Mirzamohammadi wrote: The IOMMU driver calculates the guest addressability for a DMA request based on the value of the mgaw reported from the IOMMU. However, this is a fused value and as mentioned in the spec, the guest width should be calculated based on the minimum of supported adjusted guest address width (SAGAW) and MGAW. This is from specification: "Guest addressability for a given DMA request is limited to the minimum of the value reported through this field and the adjusted guest address width of the corresponding page-table structure. (Adjusted guest address widths supported by hardware are reported through the SAGAW field)." This causes domain initialization to fail and following errors appear for EHCI PCI driver: [2.486393] ehci-pci :01:00.4: EHCI Host Controller [2.486624] ehci-pci :01:00.4: new USB bus registered, assigned bus number 1 [2.489127] ehci-pci :01:00.4: DMAR: Allocating domain failed [2.489350] ehci-pci :01:00.4: DMAR: 32bit DMA uses non-identity mapping [2.489359] ehci-pci :01:00.4: can't setup: -12 [2.489531] ehci-pci :01:00.4: USB bus 1 deregistered [2.490023] ehci-pci :01:00.4: init :01:00.4 fail, -12 [2.490358] ehci-pci: probe of :01:00.4 failed with error -12 This issue happens when the value of the sagaw corresponds to a 48-bit agaw. This fix updates the calculation of the agaw based on the minimum of IOMMU's sagaw value and MGAW. Signed-off-by: Saeed Mirzamohammadi Tested-by: Camille Lu --- Change in v2: - Added cap_width to calculate AGAW based on the minimum value of MGAW and AGAW. --- drivers/iommu/intel-iommu.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 953d86ca6d2b..a2a03df97704 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -1853,7 +1853,7 @@ static inline int guestwidth_to_adjustwidth(int gaw) static int domain_init(struct dmar_domain *domain, struct intel_iommu *iommu, int guest_width) { - int adjust_width, agaw; + int adjust_width, agaw, cap_width; unsigned long sagaw; int err; @@ -1867,8 +1867,9 @@ static int domain_init(struct dmar_domain *domain, struct intel_iommu *iommu, domain_reserve_special_ranges(domain); /* calculate AGAW */ - if (guest_width > cap_mgaw(iommu->cap)) - guest_width = cap_mgaw(iommu->cap); + cap_width = min_t(int, cap_mgaw(iommu->cap), agaw_to_width(iommu->agaw)); + if (guest_width > cap_width) + guest_width = cap_width; domain->gaw = guest_width; adjust_width = guestwidth_to_adjustwidth(guest_width); agaw = width_to_agaw(adjust_width); Reviewed-by: Lu Baolu Best regards, baolu
Re: [PATCH v8 7/9] vfio/mdev: Add iommu related member in mdev_device
Hi Jason, On 4/7/21 4:00 AM, Jason Gunthorpe wrote: On Mon, Mar 25, 2019 at 09:30:34AM +0800, Lu Baolu wrote: A parent device might create different types of mediated devices. For example, a mediated device could be created by the parent device with full isolation and protection provided by the IOMMU. One usage case could be found on Intel platforms where a mediated device is an assignable subset of a PCI, the DMA requests on behalf of it are all tagged with a PASID. Since IOMMU supports PASID-granular translations (scalable mode in VT-d 3.0), this mediated device could be individually protected and isolated by an IOMMU. This patch adds a new member in the struct mdev_device to indicate that the mediated device represented by mdev could be isolated and protected by attaching a domain to a device represented by mdev->iommu_device. It also adds a helper to add or set the iommu device. * mdev_device->iommu_device - This, if set, indicates that the mediated device could be fully isolated and protected by IOMMU via attaching an iommu domain to this device. If empty, it indicates using vendor defined isolation, hence bypass IOMMU. * mdev_set/get_iommu_device(dev, iommu_device) - Set or get the iommu device which represents this mdev in IOMMU's device scope. Drivers don't need to set the iommu device if it uses vendor defined isolation. Cc: Ashok Raj Cc: Jacob Pan Cc: Kevin Tian Cc: Liu Yi L Suggested-by: Kevin Tian Suggested-by: Alex Williamson Signed-off-by: Lu Baolu Reviewed-by: Jean-Philippe Brucker --- drivers/vfio/mdev/mdev_core.c| 18 ++ drivers/vfio/mdev/mdev_private.h | 1 + include/linux/mdev.h | 14 ++ 3 files changed, 33 insertions(+) diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c index b96fedc77ee5..1b6435529166 100644 +++ b/drivers/vfio/mdev/mdev_core.c @@ -390,6 +390,24 @@ int mdev_device_remove(struct device *dev, bool force_remove) return 0; } +int mdev_set_iommu_device(struct device *dev, struct device *iommu_device) +{ + struct mdev_device *mdev = to_mdev_device(dev); + + mdev->iommu_device = iommu_device; + + return 0; +} +EXPORT_SYMBOL(mdev_set_iommu_device); I was looking at these functions when touching the mdev stuff and I have some concerns. 1) Please don't merge dead code. It is a year later and there is still no in-tree user for any of this. This is not our process. Even worse it was exported so it looks like this dead code is supporting out of tree modules. 2) Why is this like this? Every struct device already has a connection to the iommu layer and every mdev has a struct device all its own. Why did we need to add special 'if (mdev)' stuff all over the place? This smells like the same abuse Thomas and I pointed out for the interrupt domains. I've ever tried to implement a bus iommu_ops for mdev devices. https://lore.kernel.org/lkml/20201030045809.957927-1-baolu...@linux.intel.com/ Any comments? Best regards, baolu After my next series the mdev drivers will have direct access to the vfio_device. So an alternative to using the struct device, or adding 'if mdev' is to add an API to the vfio_device world to inject what iommu configuration is needed from that direction instead of trying to discover it from a struct device. 3) The vfio_bus_is_mdev() and related symbol_get() nonsense in drivers/vfio/vfio_iommu_type1.c has to go, for the same reasons it was not acceptable to do this for the interrupt side either. 4) It seems pretty clear to me this will be heavily impacted by the /dev/ioasid discussion. Please consider removing the dead code now. Basically, please fix this before trying to get idxd mdev merged as the first user. Jason
Re: [PATCH v5.4 1/1] iommu/vt-d: Fix agaw for a supported 48 bit guest address width
Hi Saeed, On 4/7/21 12:35 AM, Saeed Mirzamohammadi wrote: The IOMMU driver calculates the guest addressability for a DMA request based on the value of the mgaw reported from the IOMMU. However, this is a fused value and as mentioned in the spec, the guest width should be calculated based on the supported adjusted guest address width (SAGAW). This is from specification: "Guest addressability for a given DMA request is limited to the minimum of the value reported through this field and the adjusted guest address width of the corresponding page-table structure. (Adjusted guest address widths supported by hardware are reported through the SAGAW field)." This causes domain initialization to fail and following errors appear for EHCI PCI driver: [2.486393] ehci-pci :01:00.4: EHCI Host Controller [2.486624] ehci-pci :01:00.4: new USB bus registered, assigned bus number 1 [2.489127] ehci-pci :01:00.4: DMAR: Allocating domain failed [2.489350] ehci-pci :01:00.4: DMAR: 32bit DMA uses non-identity mapping [2.489359] ehci-pci :01:00.4: can't setup: -12 [2.489531] ehci-pci :01:00.4: USB bus 1 deregistered [2.490023] ehci-pci :01:00.4: init :01:00.4 fail, -12 [2.490358] ehci-pci: probe of :01:00.4 failed with error -12 This issue happens when the value of the sagaw corresponds to a 48-bit agaw. This fix updates the calculation of the agaw based on the IOMMU's sagaw value. Signed-off-by: Saeed Mirzamohammadi Tested-by: Camille Lu --- drivers/iommu/intel-iommu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 953d86ca6d2b..396e14fad54b 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -1867,8 +1867,8 @@ static int domain_init(struct dmar_domain *domain, struct intel_iommu *iommu, domain_reserve_special_ranges(domain); /* calculate AGAW */ - if (guest_width > cap_mgaw(iommu->cap)) - guest_width = cap_mgaw(iommu->cap); + if (guest_width > agaw_to_width(iommu->agaw)) + guest_width = agaw_to_width(iommu->agaw); The spec requires to use a minimum of MGAW and AGAW, so why not, cap_width = min_t(int, cap_mgaw(iommu->cap), agaw_to_width(iommu->agaw)); if (guest_width > cap_width) guest_width = cap_width; Best regards, baolu domain->gaw = guest_width; adjust_width = guestwidth_to_adjustwidth(guest_width); agaw = width_to_agaw(adjust_width);
Re: [PATCH v2 4/5] iommu/vt-d: Use user privilege for RID2PASID translation
On 3/20/21 10:54 AM, Lu Baolu wrote: When first-level page tables are used for IOVA translation, we use user privilege by setting U/S bit in the page table entry. This is to make it consistent with the second level translation, where the U/S enforcement is not available. Clear the SRE (Supervisor Request Enable) field in the pasid table entry of RID2PASID so that requests requesting the supervisor privilege are blocked and treated as DMA remapping faults. Suggested-by: Jacob Pan Fixes: b802d070a52a1 ("iommu/vt-d: Use iova over first level") Signed-off-by: Lu Baolu We found some devices still require SRE to be set during internal tests. I will drop this patch from my queue for v5.13 for now. Best regards, baolu --- drivers/iommu/intel/iommu.c | 7 +-- drivers/iommu/intel/pasid.c | 3 ++- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 868f195f55ff..7354f9ce47d8 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -2494,9 +2494,9 @@ static int domain_setup_first_level(struct intel_iommu *iommu, struct device *dev, u32 pasid) { - int flags = PASID_FLAG_SUPERVISOR_MODE; struct dma_pte *pgd = domain->pgd; int agaw, level; + int flags = 0; /* * Skip top levels of page tables for iommu which has @@ -2512,7 +2512,10 @@ static int domain_setup_first_level(struct intel_iommu *iommu, if (level != 4 && level != 5) return -EINVAL; - flags |= (level == 5) ? PASID_FLAG_FL5LP : 0; + if (pasid != PASID_RID2PASID) + flags |= PASID_FLAG_SUPERVISOR_MODE; + if (level == 5) + flags |= PASID_FLAG_FL5LP; return intel_pasid_setup_first_level(iommu, dev, (pgd_t *)pgd, pasid, domain->iommu_did[iommu->seq_id], diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 0bf7e0a76890..dd69df5a188a 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -673,7 +673,8 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu, * Since it is a second level only translation setup, we should * set SRE bit as well (addresses are expected to be GPAs). */ - pasid_set_sre(pte); + if (pasid != PASID_RID2PASID) + pasid_set_sre(pte); pasid_set_present(pte); pasid_flush_caches(iommu, pte, pasid, did);
Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage
On 4/2/21 11:41 AM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: Hi Baolu, 在 2021/4/2 11:06, Lu Baolu 写道: Hi Longpeng, On 4/1/21 3:18 PM, Longpeng(Mike) wrote: The translation caches may preserve obsolete data when the mapping size is changed, suppose the following sequence which can reveal the problem with high probability. 1.mmap(4GB,MAP_HUGETLB) 2. while (1) { (a) DMA MAP 0,0xa (b) DMA UNMAP 0,0xa (c) DMA MAP 0,0xc000 * DMA read IOVA 0 may failure here (Not present) * if the problem occurs. (d) DMA UNMAP 0,0xc000 } The page table(only focus on IOVA 0) after (a) is: PML4: 0x19db5c1003 entry:0x899bdcd2f000 PDPE: 0x1a1cacb003 entry:0x89b35b5c1000 PDE: 0x1a30a72003 entry:0x89b39cacb000 PTE: 0x21d200803 entry:0x89b3b0a72000 The page table after (b) is: PML4: 0x19db5c1003 entry:0x899bdcd2f000 PDPE: 0x1a1cacb003 entry:0x89b35b5c1000 PDE: 0x1a30a72003 entry:0x89b39cacb000 PTE: 0x0 entry:0x89b3b0a72000 The page table after (c) is: PML4: 0x19db5c1003 entry:0x899bdcd2f000 PDPE: 0x1a1cacb003 entry:0x89b35b5c1000 PDE: 0x21d200883 entry:0x89b39cacb000 (*) Because the PDE entry after (b) is present, it won't be flushed even if the iommu driver flush cache when unmap, so the obsolete data may be preserved in cache, which would cause the wrong translation at end. However, we can see the PDE entry is finally switch to 2M-superpage mapping, but it does not transform to 0x21d200883 directly: 1. PDE: 0x1a30a72003 2. __domain_mapping dma_pte_free_pagetable Set the PDE entry to ZERO Set the PDE entry to 0x21d200883 So we must flush the cache after the entry switch to ZERO to avoid the obsolete info be preserved. Cc: David Woodhouse Cc: Lu Baolu Cc: Nadav Amit Cc: Alex Williamson Cc: Kevin Tian Cc: Gonglei (Arei) Fixes: 6491d4d02893 ("intel-iommu: Free old page tables before creating superpage") Cc: # v3.0+ Link: https://lore.kernel.org/linux-iommu/670baaf8-4ff8-4e84-4be3-030b95ab5...@huawei.com/ Suggested-by: Lu Baolu Signed-off-by: Longpeng(Mike) --- drivers/iommu/intel/iommu.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index ee09323..cbcb434 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -2342,9 +2342,20 @@ static inline int hardware_largepage_caps(struct dmar_domain *domain, * removed to make room for superpage(s). * We're adding new large pages, so make sure * we don't remove their parent tables. + * + * We also need to flush the iotlb before creating + * superpage to ensure it does not perserves any + * obsolete info. */ - dma_pte_free_pagetable(domain, iov_pfn, end_pfn, - largepage_lvl + 1); + if (dma_pte_present(pte)) { + int i; + + dma_pte_free_pagetable(domain, iov_pfn, end_pfn, + largepage_lvl + 1); + for_each_domain_iommu(i, domain) + iommu_flush_iotlb_psi(g_iommus[i], domain, + iov_pfn, nr_pages, 0, 0); Thanks for patch! How about making the flushed page size accurate? For example, @@ -2365,8 +2365,8 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, dma_pte_free_pagetable(domain, iov_pfn, end_pfn, largepage_lvl + 1); for_each_domain_iommu(i, domain) - iommu_flush_iotlb_psi(g_iommus[i], domain, - iov_pfn, nr_pages, 0, 0); + iommu_flush_iotlb_psi(g_iommus[i], domain, iov_pfn, + ALIGN_DOWN(nr_pages, lvl_pages), 0, 0); Yes, make sense. Maybe another alternative is 'end_pfn - iova_pfn + 1', it's readable because we free pagetable with (iova_pfn, end_pfn) above. Which one do you prefer? Yours looks better. By the way, if you are willing to prepare a v2, please make sure to add Joerg (IOMMU subsystem maintainer) to the list. Best regards, baolu
Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage
Hi Longpeng, On 4/1/21 3:18 PM, Longpeng(Mike) wrote: diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index ee09323..cbcb434 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -2342,9 +2342,20 @@ static inline int hardware_largepage_caps(struct dmar_domain *domain, * removed to make room for superpage(s). * We're adding new large pages, so make sure * we don't remove their parent tables. +* +* We also need to flush the iotlb before creating +* superpage to ensure it does not perserves any +* obsolete info. */ - dma_pte_free_pagetable(domain, iov_pfn, end_pfn, - largepage_lvl + 1); + if (dma_pte_present(pte)) { The dma_pte_free_pagetable() clears a batch of PTEs. So checking current PTE is insufficient. How about removing this check and always performing cache invalidation? + int i; + + dma_pte_free_pagetable(domain, iov_pfn, end_pfn, + largepage_lvl + 1); + for_each_domain_iommu(i, domain) + iommu_flush_iotlb_psi(g_iommus[i], domain, + iov_pfn, nr_pages, 0, 0); + Best regards, baolu
Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage
Hi Longpeng, On 4/1/21 3:18 PM, Longpeng(Mike) wrote: The translation caches may preserve obsolete data when the mapping size is changed, suppose the following sequence which can reveal the problem with high probability. 1.mmap(4GB,MAP_HUGETLB) 2. while (1) { (a)DMA MAP 0,0xa (b)DMA UNMAP 0,0xa (c)DMA MAP 0,0xc000 * DMA read IOVA 0 may failure here (Not present) * if the problem occurs. (d)DMA UNMAP 0,0xc000 } The page table(only focus on IOVA 0) after (a) is: PML4: 0x19db5c1003 entry:0x899bdcd2f000 PDPE: 0x1a1cacb003 entry:0x89b35b5c1000 PDE: 0x1a30a72003 entry:0x89b39cacb000 PTE: 0x21d200803 entry:0x89b3b0a72000 The page table after (b) is: PML4: 0x19db5c1003 entry:0x899bdcd2f000 PDPE: 0x1a1cacb003 entry:0x89b35b5c1000 PDE: 0x1a30a72003 entry:0x89b39cacb000 PTE: 0x0 entry:0x89b3b0a72000 The page table after (c) is: PML4: 0x19db5c1003 entry:0x899bdcd2f000 PDPE: 0x1a1cacb003 entry:0x89b35b5c1000 PDE: 0x21d200883 entry:0x89b39cacb000 (*) Because the PDE entry after (b) is present, it won't be flushed even if the iommu driver flush cache when unmap, so the obsolete data may be preserved in cache, which would cause the wrong translation at end. However, we can see the PDE entry is finally switch to 2M-superpage mapping, but it does not transform to 0x21d200883 directly: 1. PDE: 0x1a30a72003 2. __domain_mapping dma_pte_free_pagetable Set the PDE entry to ZERO Set the PDE entry to 0x21d200883 So we must flush the cache after the entry switch to ZERO to avoid the obsolete info be preserved. Cc: David Woodhouse Cc: Lu Baolu Cc: Nadav Amit Cc: Alex Williamson Cc: Kevin Tian Cc: Gonglei (Arei) Fixes: 6491d4d02893 ("intel-iommu: Free old page tables before creating superpage") Cc: # v3.0+ Link: https://lore.kernel.org/linux-iommu/670baaf8-4ff8-4e84-4be3-030b95ab5...@huawei.com/ Suggested-by: Lu Baolu Signed-off-by: Longpeng(Mike) --- drivers/iommu/intel/iommu.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index ee09323..cbcb434 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -2342,9 +2342,20 @@ static inline int hardware_largepage_caps(struct dmar_domain *domain, * removed to make room for superpage(s). * We're adding new large pages, so make sure * we don't remove their parent tables. +* +* We also need to flush the iotlb before creating +* superpage to ensure it does not perserves any +* obsolete info. */ - dma_pte_free_pagetable(domain, iov_pfn, end_pfn, - largepage_lvl + 1); + if (dma_pte_present(pte)) { + int i; + + dma_pte_free_pagetable(domain, iov_pfn, end_pfn, + largepage_lvl + 1); + for_each_domain_iommu(i, domain) + iommu_flush_iotlb_psi(g_iommus[i], domain, + iov_pfn, nr_pages, 0, 0); Thanks for patch! How about making the flushed page size accurate? For example, @@ -2365,8 +2365,8 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, dma_pte_free_pagetable(domain, iov_pfn, end_pfn, largepage_lvl + 1); for_each_domain_iommu(i, domain) - iommu_flush_iotlb_psi(g_iommus[i], domain, - iov_pfn, nr_pages, 0, 0); + iommu_flush_iotlb_psi(g_iommus[i], domain, iov_pfn, + ALIGN_DOWN(nr_pages, lvl_pages), 0, 0); + } } else { pteval &= ~(uint64_t)DMA_PTE_LARGE_PAGE; } Best regards, baolu
[PATCH 1/1] iommu/vt-d: Report right snoop capability when using FL for IOVA
The Intel VT-d driver checks wrong register to report snoop capablility when using first level page table for GPA to HPA translation. This might lead the IOMMU driver to say that it supports snooping control, but in reality, it does not. Fix this by always setting PASID-table-entry.PGSNP whenever a pasid entry is setting up for GPA to HPA translation so that the IOMMU driver could report snoop capability as long as it runs in the scalable mode. Fixes: b802d070a52a1 ("iommu/vt-d: Use iova over first level") Suggested-by: Rajesh Sankaran Suggested-by: Kevin Tian Suggested-by: Ashok Raj Signed-off-by: Lu Baolu --- drivers/iommu/intel/pasid.h | 1 + drivers/iommu/intel/iommu.c | 11 ++- drivers/iommu/intel/pasid.c | 16 3 files changed, 27 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h index 079534fcf55d..5ff61c3d401f 100644 --- a/drivers/iommu/intel/pasid.h +++ b/drivers/iommu/intel/pasid.h @@ -48,6 +48,7 @@ */ #define PASID_FLAG_SUPERVISOR_MODE BIT(0) #define PASID_FLAG_NESTED BIT(1) +#define PASID_FLAG_PAGE_SNOOP BIT(2) /* * The PASID_FLAG_FL5LP flag Indicates using 5-level paging for first- diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 7354f9ce47d8..deaa87ad4e5f 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -657,7 +657,14 @@ static int domain_update_iommu_snooping(struct intel_iommu *skip) rcu_read_lock(); for_each_active_iommu(iommu, drhd) { if (iommu != skip) { - if (!ecap_sc_support(iommu->ecap)) { + /* +* If the hardware is operating in the scalable mode, +* the snooping control is always supported since we +* always set PASID-table-entry.PGSNP bit if the domain +* is managed outside (UNMANAGED). +*/ + if (!sm_supported(iommu) && + !ecap_sc_support(iommu->ecap)) { ret = 0; break; } @@ -2516,6 +2523,8 @@ static int domain_setup_first_level(struct intel_iommu *iommu, flags |= PASID_FLAG_SUPERVISOR_MODE; if (level == 5) flags |= PASID_FLAG_FL5LP; + if (domain->domain.type == IOMMU_DOMAIN_UNMANAGED) + flags |= PASID_FLAG_PAGE_SNOOP; return intel_pasid_setup_first_level(iommu, dev, (pgd_t *)pgd, pasid, domain->iommu_did[iommu->seq_id], diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index c896aef7db40..b901909da79e 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -425,6 +425,16 @@ static inline void pasid_set_page_snoop(struct pasid_entry *pe, bool value) pasid_set_bits(>val[1], 1 << 23, value << 23); } +/* + * Setup the Page Snoop (PGSNP) field (Bit 88) of a scalable mode + * PASID entry. + */ +static inline void +pasid_set_pgsnp(struct pasid_entry *pe) +{ + pasid_set_bits(>val[1], 1ULL << 24, 1ULL << 24); +} + /* * Setup the First Level Page table Pointer field (Bit 140~191) * of a scalable mode PASID entry. @@ -599,6 +609,9 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu, } } + if (flags & PASID_FLAG_PAGE_SNOOP) + pasid_set_pgsnp(pte); + pasid_set_domain_id(pte, did); pasid_set_address_width(pte, iommu->agaw); pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap)); @@ -677,6 +690,9 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu, pasid_set_fault_enable(pte); pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap)); + if (domain->domain.type == IOMMU_DOMAIN_UNMANAGED) + pasid_set_pgsnp(pte); + /* * Since it is a second level only translation setup, we should * set SRE bit as well (addresses are expected to be GPAs). -- 2.25.1
Re: A problem of Intel IOMMU hardware ?
Hi Nadav, On 3/27/21 12:36 PM, Nadav Amit wrote: On Mar 26, 2021, at 7:31 PM, Lu Baolu wrote: Hi Nadav, On 3/19/21 12:46 AM, Nadav Amit wrote: So here is my guess: Intel probably used as a basis for the IOTLB an implementation of some other (regular) TLB design. Intel SDM says regarding TLBs (4.10.4.2 “Recommended Invalidation”): "Software wishing to prevent this uncertainty should not write to a paging-structure entry in a way that would change, for any linear address, both the page size and either the page frame, access rights, or other attributes.” Now the aforementioned uncertainty is a bit different (multiple *valid* translations of a single address). Yet, perhaps this is yet another thing that might happen. From a brief look on the handling of MMU (not IOMMU) hugepages in Linux, indeed the PMD is first cleared and flushed before a new valid PMD is set. This is possible for MMUs since they allow the software to handle spurious page-faults gracefully. This is not the case for the IOMMU though (without PRI). Not sure this explains everything though. If that is the problem, then during a mapping that changes page-sizes, a TLB flush is needed, similarly to the one Longpeng did manually. I have been working with Longpeng on this issue these days. It turned out that your guess is right. The PMD is first cleared but not flushed before a new valid one is set. The previous entry might be cached in the paging structure caches hence leads to disaster. In __domain_mapping(): 2352 /* 2353 * Ensure that old small page tables are 2354 * removed to make room for superpage(s). 2355 * We're adding new large pages, so make sure 2356 * we don't remove their parent tables. 2357 */ 2358 dma_pte_free_pagetable(domain, iov_pfn, end_pfn, 2359 largepage_lvl + 1); I guess adding a cache flush operation after PMD switching should solve the problem. I am still not clear about this comment: " This is possible for MMUs since they allow the software to handle spurious page-faults gracefully. This is not the case for the IOMMU though (without PRI). " Can you please shed more light on this? I was looking at the code in more detail, and apparently my concern is incorrect. I was under the assumption that the IOMMU map/unmap can merge/split (specifically split) huge-pages. For instance, if you map 2MB and then unmap 4KB out of the 2MB, then you would split the hugepage and keep the rest of the mappings alive. This is the way MMU is usually managed. To my defense, I also saw such partial unmappings in Longpeng’s first scenario. If this was possible, then you would have a case in which out of 2MB (for instance), 4KB were unmapped, and you need to split the 2MB hugepage into 4KB pages. If you try to clear the PMD, flush, and then set the PMD to point to table with live 4KB PTES, you can have an interim state in which the PMD is not present. DMAs that arrive at this stage might fault, and without PRI (and device support) you do not have a way of restarting the DMA after the hugepage split is completed. Get you and thanks a lot for sharing. For current IOMMU usage, I can't see any case to split a huge page into 4KB pages, but in the near future, we do have a need of splitting huge pages. For example, when we want to use the A/D bit to track the DMA dirty pages during VM migration, it's an optimization if we could split a huge page into 4K ones. So far, the solution I have considered is: 1) Prepare the split subtables in advance; [it's identical to the existing one only use 4k pages instead of huge page.] 2) Switch the super (huge) page's leaf entry; [at this point, hardware could use both subtables. I am not sure whether the hardware allows a dynamic switch of page table entry from on valid entry to another valid one.] 3) Flush the cache. [hardware will use the new subtable] As long as we can make sure that the old subtable won't be used by hardware, we can safely release the old table. Anyhow, this concern is apparently not relevant. I guess I was too naive to assume the IOMMU management is similar to the MMU. I now see that there is a comment in intel_iommu_unmap() saying: /* Cope with horrid API which requires us to unmap more than the size argument if it happens to be a large-page mapping. */ Regards, Nadav Best regards, baolu
Re: A problem of Intel IOMMU hardware ?
Hi Nadav, On 3/19/21 12:46 AM, Nadav Amit wrote: So here is my guess: Intel probably used as a basis for the IOTLB an implementation of some other (regular) TLB design. Intel SDM says regarding TLBs (4.10.4.2 “Recommended Invalidation”): "Software wishing to prevent this uncertainty should not write to a paging-structure entry in a way that would change, for any linear address, both the page size and either the page frame, access rights, or other attributes.” Now the aforementioned uncertainty is a bit different (multiple *valid* translations of a single address). Yet, perhaps this is yet another thing that might happen. From a brief look on the handling of MMU (not IOMMU) hugepages in Linux, indeed the PMD is first cleared and flushed before a new valid PMD is set. This is possible for MMUs since they allow the software to handle spurious page-faults gracefully. This is not the case for the IOMMU though (without PRI). Not sure this explains everything though. If that is the problem, then during a mapping that changes page-sizes, a TLB flush is needed, similarly to the one Longpeng did manually. I have been working with Longpeng on this issue these days. It turned out that your guess is right. The PMD is first cleared but not flushed before a new valid one is set. The previous entry might be cached in the paging structure caches hence leads to disaster. In __domain_mapping(): 2352 /* 2353 * Ensure that old small page tables are 2354 * removed to make room for superpage(s). 2355 * We're adding new large pages, so make sure 2356 * we don't remove their parent tables. 2357 */ 2358 dma_pte_free_pagetable(domain, iov_pfn, end_pfn, 2359 largepage_lvl + 1); I guess adding a cache flush operation after PMD switching should solve the problem. I am still not clear about this comment: " This is possible for MMUs since they allow the software to handle spurious page-faults gracefully. This is not the case for the IOMMU though (without PRI). " Can you please shed more light on this? Best regards, baolu
Re: [PATCH 3/5] iommu/vt-d: Remove SVM_FLAG_PRIVATE_PASID
Hi Christoph, On 3/24/21 1:33 AM, Christoph Hellwig wrote: On Tue, Mar 23, 2021 at 09:05:58AM +0800, Lu Baolu wrote: The SVM_FLAG_PRIVATE_PASID has never been referenced in the tree, and there's no plan to have anything to use it. So cleanup it. Signed-off-by: Lu Baolu Looks good, Reviewed-by: Christoph Hellwig Thank you! But can we take this a little further? SVM_FLAG_GUEST_PASID is unused as well. SVM_FLAG_GUEST_MODE is only used in drivers/iommu/intel/svm.c, and SVM_FLAG_SUPERVISOR_MODE is actually used as an API flag to iommu_sva_bind_device. So IMHO the latter should be elevated to an IOMMU API level flag, and then include/linux/intel-svm.h can go away entirely or at least be moved to drivers/iommu/intel/. Sure. I will consider it and make it in separated patches. Best regards, baolu
Re: [PATCH 2/3] iommu/vt-d: Remove IOVA domain rcache flushing for CPU offlining
On 3/1/21 8:12 PM, John Garry wrote: Now that the core code handles flushing per-IOVA domain CPU rcaches, remove the handling here. Signed-off-by: John Garry --- drivers/iommu/intel/iommu.c | 31 --- include/linux/cpuhotplug.h | 1 - 2 files changed, 32 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index ee0932307d64..d1e66e1b07b8 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -4065,35 +4065,6 @@ static struct notifier_block intel_iommu_memory_nb = { .priority = 0 }; -static void free_all_cpu_cached_iovas(unsigned int cpu) -{ - int i; - - for (i = 0; i < g_num_of_iommus; i++) { - struct intel_iommu *iommu = g_iommus[i]; - struct dmar_domain *domain; - int did; - - if (!iommu) - continue; - - for (did = 0; did < cap_ndoms(iommu->cap); did++) { - domain = get_iommu_domain(iommu, (u16)did); - - if (!domain || domain->domain.type != IOMMU_DOMAIN_DMA) - continue; - - iommu_dma_free_cpu_cached_iovas(cpu, >domain); - } - } -} - -static int intel_iommu_cpu_dead(unsigned int cpu) -{ - free_all_cpu_cached_iovas(cpu); - return 0; -} - static void intel_disable_iommus(void) { struct intel_iommu *iommu = NULL; @@ -4388,8 +4359,6 @@ int __init intel_iommu_init(void) bus_set_iommu(_bus_type, _iommu_ops); if (si_domain && !hw_pass_through) register_memory_notifier(_iommu_memory_nb); - cpuhp_setup_state(CPUHP_IOMMU_INTEL_DEAD, "iommu/intel:dead", NULL, - intel_iommu_cpu_dead); down_read(_global_lock); if (probe_acpi_namespace_devices()) diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index cedac9986557..85996494bec1 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -57,7 +57,6 @@ enum cpuhp_state { CPUHP_PAGE_ALLOC_DEAD, CPUHP_NET_DEV_DEAD, CPUHP_PCI_XGENE_DEAD, - CPUHP_IOMMU_INTEL_DEAD, CPUHP_IOMMU_IOVA_DEAD, CPUHP_LUSTRE_CFS_DEAD, CPUHP_AP_ARM_CACHE_B15_RAC_DEAD, Reviewed-by: Lu Baolu Best regards, baolu
[PATCH 5/5] iommu/vt-d: Make unnecessarily global functions static
Make some functions static as they are only used inside pasid.c. Signed-off-by: Lu Baolu --- drivers/iommu/intel/pasid.c | 4 ++-- drivers/iommu/intel/pasid.h | 2 -- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index f2c747e62c6a..c896aef7db40 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -230,7 +230,7 @@ struct pasid_table *intel_pasid_get_table(struct device *dev) return info->pasid_table; } -int intel_pasid_get_dev_max_id(struct device *dev) +static int intel_pasid_get_dev_max_id(struct device *dev) { struct device_domain_info *info; @@ -241,7 +241,7 @@ int intel_pasid_get_dev_max_id(struct device *dev) return info->pasid_table->max_pasid; } -struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid) +static struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid) { struct device_domain_info *info; struct pasid_table *pasid_table; diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h index 90a3268d7a77..079534fcf55d 100644 --- a/drivers/iommu/intel/pasid.h +++ b/drivers/iommu/intel/pasid.h @@ -102,8 +102,6 @@ extern unsigned int intel_pasid_max_id; int intel_pasid_alloc_table(struct device *dev); void intel_pasid_free_table(struct device *dev); struct pasid_table *intel_pasid_get_table(struct device *dev); -int intel_pasid_get_dev_max_id(struct device *dev); -struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid); int intel_pasid_setup_first_level(struct intel_iommu *iommu, struct device *dev, pgd_t *pgd, u32 pasid, u16 did, int flags); -- 2.25.1
[PATCH 2/5] iommu/vt-d: Remove svm_dev_ops
The svm_dev_ops has never been referenced in the tree, and there's no plan to have anything to use it. Remove it to make the code neat. Signed-off-by: Lu Baolu --- drivers/iommu/intel/svm.c | 15 +-- include/linux/intel-iommu.h | 3 --- include/linux/intel-svm.h | 7 --- 3 files changed, 1 insertion(+), 24 deletions(-) diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index 5d590d63ab52..875ee74d8c41 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -462,7 +462,6 @@ static void load_pasid(struct mm_struct *mm, u32 pasid) /* Caller must hold pasid_mutex, mm reference */ static int intel_svm_bind_mm(struct device *dev, unsigned int flags, - struct svm_dev_ops *ops, struct mm_struct *mm, struct intel_svm_dev **sd) { struct intel_iommu *iommu = device_to_iommu(dev, NULL, NULL); @@ -512,10 +511,6 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags, /* Find the matching device in svm list */ for_each_svm_dev(sdev, svm, dev) { - if (sdev->ops != ops) { - ret = -EBUSY; - goto out; - } sdev->users++; goto success; } @@ -550,7 +545,6 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags, /* Finish the setup now we know we're keeping it */ sdev->users = 1; - sdev->ops = ops; init_rcu_head(>rcu); if (!svm) { @@ -1006,13 +1000,6 @@ static irqreturn_t prq_event_thread(int irq, void *d) mmap_read_unlock(svm->mm); mmput(svm->mm); bad_req: - WARN_ON(!sdev); - if (sdev && sdev->ops && sdev->ops->fault_cb) { - int rwxp = (req->rd_req << 3) | (req->wr_req << 2) | - (req->exe_req << 1) | (req->pm_req); - sdev->ops->fault_cb(sdev->dev, req->pasid, req->addr, - req->priv_data, rwxp, result); - } /* We get here in the error case where the PASID lookup failed, and these can be NULL. Do not use them below this point! */ sdev = NULL; @@ -1087,7 +1074,7 @@ intel_svm_bind(struct device *dev, struct mm_struct *mm, void *drvdata) if (drvdata) flags = *(unsigned int *)drvdata; mutex_lock(_mutex); - ret = intel_svm_bind_mm(dev, flags, NULL, mm, ); + ret = intel_svm_bind_mm(dev, flags, mm, ); if (ret) sva = ERR_PTR(ret); else if (sdev) diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h index 76f974da8ca4..03faf20a6817 100644 --- a/include/linux/intel-iommu.h +++ b/include/linux/intel-iommu.h @@ -770,14 +770,11 @@ u32 intel_svm_get_pasid(struct iommu_sva *handle); int intel_svm_page_response(struct device *dev, struct iommu_fault_event *evt, struct iommu_page_response *msg); -struct svm_dev_ops; - struct intel_svm_dev { struct list_head list; struct rcu_head rcu; struct device *dev; struct intel_iommu *iommu; - struct svm_dev_ops *ops; struct iommu_sva sva; u32 pasid; int users; diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h index 39d368a810b8..6c9d10c0fb1e 100644 --- a/include/linux/intel-svm.h +++ b/include/linux/intel-svm.h @@ -8,13 +8,6 @@ #ifndef __INTEL_SVM_H__ #define __INTEL_SVM_H__ -struct device; - -struct svm_dev_ops { - void (*fault_cb)(struct device *dev, u32 pasid, u64 address, -void *private, int rwxp, int response); -}; - /* Values for rxwp in fault_cb callback */ #define SVM_REQ_READ (1<<3) #define SVM_REQ_WRITE (1<<2) -- 2.25.1
[PATCH 3/5] iommu/vt-d: Remove SVM_FLAG_PRIVATE_PASID
The SVM_FLAG_PRIVATE_PASID has never been referenced in the tree, and there's no plan to have anything to use it. So cleanup it. Signed-off-by: Lu Baolu --- drivers/iommu/intel/svm.c | 40 ++- include/linux/intel-svm.h | 16 +++- 2 files changed, 21 insertions(+), 35 deletions(-) diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index 875ee74d8c41..5165cea90421 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -465,9 +465,9 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags, struct mm_struct *mm, struct intel_svm_dev **sd) { struct intel_iommu *iommu = device_to_iommu(dev, NULL, NULL); + struct intel_svm *svm = NULL, *t; struct device_domain_info *info; struct intel_svm_dev *sdev; - struct intel_svm *svm = NULL; unsigned long iflags; int pasid_max; int ret; @@ -493,30 +493,26 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags, } } - if (!(flags & SVM_FLAG_PRIVATE_PASID)) { - struct intel_svm *t; - - list_for_each_entry(t, _svm_list, list) { - if (t->mm != mm || (t->flags & SVM_FLAG_PRIVATE_PASID)) - continue; - - svm = t; - if (svm->pasid >= pasid_max) { - dev_warn(dev, -"Limited PASID width. Cannot use existing PASID %d\n", -svm->pasid); - ret = -ENOSPC; - goto out; - } + list_for_each_entry(t, _svm_list, list) { + if (t->mm != mm) + continue; - /* Find the matching device in svm list */ - for_each_svm_dev(sdev, svm, dev) { - sdev->users++; - goto success; - } + svm = t; + if (svm->pasid >= pasid_max) { + dev_warn(dev, +"Limited PASID width. Cannot use existing PASID %d\n", +svm->pasid); + ret = -ENOSPC; + goto out; + } - break; + /* Find the matching device in svm list */ + for_each_svm_dev(sdev, svm, dev) { + sdev->users++; + goto success; } + + break; } sdev = kzalloc(sizeof(*sdev), GFP_KERNEL); diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h index 6c9d10c0fb1e..10fa80eef13a 100644 --- a/include/linux/intel-svm.h +++ b/include/linux/intel-svm.h @@ -14,16 +14,6 @@ #define SVM_REQ_EXEC (1<<1) #define SVM_REQ_PRIV (1<<0) -/* - * The SVM_FLAG_PRIVATE_PASID flag requests a PASID which is *not* the "main" - * PASID for the current process. Even if a PASID already exists, a new one - * will be allocated. And the PASID allocated with SVM_FLAG_PRIVATE_PASID - * will not be given to subsequent callers. This facility allows a driver to - * disambiguate between multiple device contexts which access the same MM, - * if there is no other way to do so. It should be used sparingly, if at all. - */ -#define SVM_FLAG_PRIVATE_PASID (1<<0) - /* * The SVM_FLAG_SUPERVISOR_MODE flag requests a PASID which can be used only * for access to kernel addresses. No IOTLB flushes are automatically done @@ -35,18 +25,18 @@ * It is unlikely that we will ever hook into flush_tlb_kernel_range() to * do such IOTLB flushes automatically. */ -#define SVM_FLAG_SUPERVISOR_MODE (1<<1) +#define SVM_FLAG_SUPERVISOR_MODE BIT(0) /* * The SVM_FLAG_GUEST_MODE flag is used when a PASID bind is for guest * processes. Compared to the host bind, the primary differences are: * 1. mm life cycle management * 2. fault reporting */ -#define SVM_FLAG_GUEST_MODE(1<<2) +#define SVM_FLAG_GUEST_MODEBIT(1) /* * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own PASID space, * which requires guest and host PASID translation at both directions. */ -#define SVM_FLAG_GUEST_PASID (1<<3) +#define SVM_FLAG_GUEST_PASID BIT(2) #endif /* __INTEL_SVM_H__ */ -- 2.25.1
[PATCH 0/5] iommu/vt-d: Several misc cleanups
Hi Joerg et al, This series includes several cleanups in the VT-d driver. Please help to review. Best regards, baolu Lu Baolu (5): iommu/vt-d: Remove unused dma map/unmap trace events iommu/vt-d: Remove svm_dev_ops iommu/vt-d: Remove SVM_FLAG_PRIVATE_PASID iommu/vt-d: Remove unused function declarations iommu/vt-d: Make unnecessarily global functions static drivers/iommu/intel/pasid.c| 4 +- drivers/iommu/intel/pasid.h| 5 -- drivers/iommu/intel/svm.c | 55 + include/linux/intel-iommu.h| 3 - include/linux/intel-svm.h | 23 +- include/trace/events/intel_iommu.h | 120 - 6 files changed, 24 insertions(+), 186 deletions(-) -- 2.25.1
[PATCH 1/5] iommu/vt-d: Remove unused dma map/unmap trace events
With commit c588072bba6b5 ("iommu/vt-d: Convert intel iommu driver to the iommu ops"), the trace events for dma map/unmap have no users any more. Cleanup them to make the code neat. Signed-off-by: Lu Baolu --- include/trace/events/intel_iommu.h | 120 - 1 file changed, 120 deletions(-) diff --git a/include/trace/events/intel_iommu.h b/include/trace/events/intel_iommu.h index e801f4910522..d233f2916584 100644 --- a/include/trace/events/intel_iommu.h +++ b/include/trace/events/intel_iommu.h @@ -15,126 +15,6 @@ #include #include -DECLARE_EVENT_CLASS(dma_map, - TP_PROTO(struct device *dev, dma_addr_t dev_addr, phys_addr_t phys_addr, -size_t size), - - TP_ARGS(dev, dev_addr, phys_addr, size), - - TP_STRUCT__entry( - __string(dev_name, dev_name(dev)) - __field(dma_addr_t, dev_addr) - __field(phys_addr_t, phys_addr) - __field(size_t, size) - ), - - TP_fast_assign( - __assign_str(dev_name, dev_name(dev)); - __entry->dev_addr = dev_addr; - __entry->phys_addr = phys_addr; - __entry->size = size; - ), - - TP_printk("dev=%s dev_addr=0x%llx phys_addr=0x%llx size=%zu", - __get_str(dev_name), - (unsigned long long)__entry->dev_addr, - (unsigned long long)__entry->phys_addr, - __entry->size) -); - -DEFINE_EVENT(dma_map, map_single, - TP_PROTO(struct device *dev, dma_addr_t dev_addr, phys_addr_t phys_addr, -size_t size), - TP_ARGS(dev, dev_addr, phys_addr, size) -); - -DEFINE_EVENT(dma_map, bounce_map_single, - TP_PROTO(struct device *dev, dma_addr_t dev_addr, phys_addr_t phys_addr, -size_t size), - TP_ARGS(dev, dev_addr, phys_addr, size) -); - -DECLARE_EVENT_CLASS(dma_unmap, - TP_PROTO(struct device *dev, dma_addr_t dev_addr, size_t size), - - TP_ARGS(dev, dev_addr, size), - - TP_STRUCT__entry( - __string(dev_name, dev_name(dev)) - __field(dma_addr_t, dev_addr) - __field(size_t, size) - ), - - TP_fast_assign( - __assign_str(dev_name, dev_name(dev)); - __entry->dev_addr = dev_addr; - __entry->size = size; - ), - - TP_printk("dev=%s dev_addr=0x%llx size=%zu", - __get_str(dev_name), - (unsigned long long)__entry->dev_addr, - __entry->size) -); - -DEFINE_EVENT(dma_unmap, unmap_single, - TP_PROTO(struct device *dev, dma_addr_t dev_addr, size_t size), - TP_ARGS(dev, dev_addr, size) -); - -DEFINE_EVENT(dma_unmap, unmap_sg, - TP_PROTO(struct device *dev, dma_addr_t dev_addr, size_t size), - TP_ARGS(dev, dev_addr, size) -); - -DEFINE_EVENT(dma_unmap, bounce_unmap_single, - TP_PROTO(struct device *dev, dma_addr_t dev_addr, size_t size), - TP_ARGS(dev, dev_addr, size) -); - -DECLARE_EVENT_CLASS(dma_map_sg, - TP_PROTO(struct device *dev, int index, int total, -struct scatterlist *sg), - - TP_ARGS(dev, index, total, sg), - - TP_STRUCT__entry( - __string(dev_name, dev_name(dev)) - __field(dma_addr_t, dev_addr) - __field(phys_addr_t, phys_addr) - __field(size_t, size) - __field(int, index) - __field(int, total) - ), - - TP_fast_assign( - __assign_str(dev_name, dev_name(dev)); - __entry->dev_addr = sg->dma_address; - __entry->phys_addr = sg_phys(sg); - __entry->size = sg->dma_length; - __entry->index = index; - __entry->total = total; - ), - - TP_printk("dev=%s [%d/%d] dev_addr=0x%llx phys_addr=0x%llx size=%zu", - __get_str(dev_name), __entry->index, __entry->total, - (unsigned long long)__entry->dev_addr, - (unsigned long long)__entry->phys_addr, - __entry->size) -); - -DEFINE_EVENT(dma_map_sg, map_sg, - TP_PROTO(struct device *dev, int index, int total, -struct scatterlist *sg), - TP_ARGS(dev, index, total, sg) -); - -DEFINE_EVENT(dma_map_sg, bounce_map_sg, - TP_PROTO(struct device *dev, int index, int total, -struct scatterlist *sg), - TP_ARGS(dev, index, total, sg) -); - TRACE_EVENT(qi_submit, TP_PROTO(struct intel_iommu *iommu, u64 qw0, u64 qw1, u64 qw2, u64 qw3), -- 2.25.1
[PATCH 4/5] iommu/vt-d: Remove unused function declarations
Some functions have been deprecated. Remove the remaining function delarations. Signed-off-by: Lu Baolu --- drivers/iommu/intel/pasid.h | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h index 444c0bec221a..90a3268d7a77 100644 --- a/drivers/iommu/intel/pasid.h +++ b/drivers/iommu/intel/pasid.h @@ -99,9 +99,6 @@ static inline bool pasid_pte_is_present(struct pasid_entry *pte) } extern unsigned int intel_pasid_max_id; -int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp); -void intel_pasid_free_id(u32 pasid); -void *intel_pasid_lookup_id(u32 pasid); int intel_pasid_alloc_table(struct device *dev); void intel_pasid_free_table(struct device *dev); struct pasid_table *intel_pasid_get_table(struct device *dev); -- 2.25.1
[PATCH v2 5/5] iommu/vt-d: Avoid unnecessary cache flush in pasid entry teardown
When a present pasid entry is disassembled, all kinds of pasid related caches need to be flushed. But when a pasid entry is not being used (PRESENT bit not set), we don't need to do this. Check the PRESENT bit in intel_pasid_tear_down_entry() and avoid flushing caches if it's not set. Signed-off-by: Lu Baolu --- drivers/iommu/intel/pasid.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index dd69df5a188a..7a73385edcc0 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -502,6 +502,9 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev, if (WARN_ON(!pte)) return; + if (!(pte->val[0] & PASID_PTE_PRESENT)) + return; + did = pasid_get_domain_id(pte); intel_pasid_clear_entry(dev, pasid, fault_ignore); -- 2.25.1
[PATCH v2 3/5] iommu/vt-d: Invalidate PASID cache when root/context entry changed
When the Intel IOMMU is operating in the scalable mode, some information from the root and context table may be used to tag entries in the PASID cache. Software should invalidate the PASID-cache when changing root or context table entries. Suggested-by: Ashok Raj Fixes: 7373a8cc38197 ("iommu/vt-d: Setup context and enable RID2PASID support") Signed-off-by: Lu Baolu --- drivers/iommu/intel/iommu.c | 18 +- include/linux/intel-iommu.h | 1 + 2 files changed, 10 insertions(+), 9 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 132cbf9f214f..868f195f55ff 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -1339,6 +1339,11 @@ static void iommu_set_root_entry(struct intel_iommu *iommu) readl, (sts & DMA_GSTS_RTPS), sts); raw_spin_unlock_irqrestore(>register_lock, flag); + + iommu->flush.flush_context(iommu, 0, 0, 0, DMA_CCMD_GLOBAL_INVL); + if (sm_supported(iommu)) + qi_flush_pasid_cache(iommu, 0, QI_PC_GLOBAL, 0); + iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH); } void iommu_flush_write_buffer(struct intel_iommu *iommu) @@ -2422,6 +2427,10 @@ static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 devfn (((u16)bus) << 8) | devfn, DMA_CCMD_MASK_NOBIT, DMA_CCMD_DEVICE_INVL); + + if (sm_supported(iommu)) + qi_flush_pasid_cache(iommu, did_old, QI_PC_ALL_PASIDS, 0); + iommu->flush.flush_iotlb(iommu, did_old, 0, @@ -3267,8 +3276,6 @@ static int __init init_dmars(void) register_pasid_allocator(iommu); #endif iommu_set_root_entry(iommu); - iommu->flush.flush_context(iommu, 0, 0, 0, DMA_CCMD_GLOBAL_INVL); - iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH); } #ifdef CONFIG_INTEL_IOMMU_BROKEN_GFX_WA @@ -3458,12 +3465,7 @@ static int init_iommu_hw(void) } iommu_flush_write_buffer(iommu); - iommu_set_root_entry(iommu); - - iommu->flush.flush_context(iommu, 0, 0, 0, - DMA_CCMD_GLOBAL_INVL); - iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH); iommu_enable_translation(iommu); iommu_disable_protect_mem_regions(iommu); } @@ -3846,8 +3848,6 @@ static int intel_iommu_add(struct dmar_drhd_unit *dmaru) goto disable_iommu; iommu_set_root_entry(iommu); - iommu->flush.flush_context(iommu, 0, 0, 0, DMA_CCMD_GLOBAL_INVL); - iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH); iommu_enable_translation(iommu); iommu_disable_protect_mem_regions(iommu); diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h index 1732298ce888..76f974da8ca4 100644 --- a/include/linux/intel-iommu.h +++ b/include/linux/intel-iommu.h @@ -378,6 +378,7 @@ enum { /* PASID cache invalidation granu */ #define QI_PC_ALL_PASIDS 0 #define QI_PC_PASID_SEL1 +#define QI_PC_GLOBAL 3 #define QI_EIOTLB_ADDR(addr) ((u64)(addr) & VTD_PAGE_MASK) #define QI_EIOTLB_IH(ih) (((u64)ih) << 6) -- 2.25.1
[PATCH v2 4/5] iommu/vt-d: Use user privilege for RID2PASID translation
When first-level page tables are used for IOVA translation, we use user privilege by setting U/S bit in the page table entry. This is to make it consistent with the second level translation, where the U/S enforcement is not available. Clear the SRE (Supervisor Request Enable) field in the pasid table entry of RID2PASID so that requests requesting the supervisor privilege are blocked and treated as DMA remapping faults. Suggested-by: Jacob Pan Fixes: b802d070a52a1 ("iommu/vt-d: Use iova over first level") Signed-off-by: Lu Baolu --- drivers/iommu/intel/iommu.c | 7 +-- drivers/iommu/intel/pasid.c | 3 ++- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 868f195f55ff..7354f9ce47d8 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -2494,9 +2494,9 @@ static int domain_setup_first_level(struct intel_iommu *iommu, struct device *dev, u32 pasid) { - int flags = PASID_FLAG_SUPERVISOR_MODE; struct dma_pte *pgd = domain->pgd; int agaw, level; + int flags = 0; /* * Skip top levels of page tables for iommu which has @@ -2512,7 +2512,10 @@ static int domain_setup_first_level(struct intel_iommu *iommu, if (level != 4 && level != 5) return -EINVAL; - flags |= (level == 5) ? PASID_FLAG_FL5LP : 0; + if (pasid != PASID_RID2PASID) + flags |= PASID_FLAG_SUPERVISOR_MODE; + if (level == 5) + flags |= PASID_FLAG_FL5LP; return intel_pasid_setup_first_level(iommu, dev, (pgd_t *)pgd, pasid, domain->iommu_did[iommu->seq_id], diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 0bf7e0a76890..dd69df5a188a 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -673,7 +673,8 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu, * Since it is a second level only translation setup, we should * set SRE bit as well (addresses are expected to be GPAs). */ - pasid_set_sre(pte); + if (pasid != PASID_RID2PASID) + pasid_set_sre(pte); pasid_set_present(pte); pasid_flush_caches(iommu, pte, pasid, did); -- 2.25.1
[PATCH v2 0/5] iommu/vt-d: Several misc fixes
Hi Joerg, This series includes some misc fixes for the VT-d iommu driver. Please help to review and merge. Best regards, baolu Change log: v1->v2: - v1: https://lore.kernel.org/linux-iommu/20210225062654.2864322-1-baolu...@linux.intel.com/ - [PATCH 2/5] iommu/vt-d: Remove WO permissions on second-level paging entries - Refine the commit message to make the intention clear. Lu Baolu (5): iommu/vt-d: Report the right page fault address iommu/vt-d: Remove WO permissions on second-level paging entries iommu/vt-d: Invalidate PASID cache when root/context entry changed iommu/vt-d: Use user privilege for RID2PASID translation iommu/vt-d: Avoid unnecessary cache flush in pasid entry teardown drivers/iommu/intel/iommu.c | 28 drivers/iommu/intel/pasid.c | 6 +- drivers/iommu/intel/svm.c | 2 +- include/linux/intel-iommu.h | 1 + 4 files changed, 23 insertions(+), 14 deletions(-) -- 2.25.1
[PATCH v2 1/5] iommu/vt-d: Report the right page fault address
The Address field of the Page Request Descriptor only keeps bit [63:12] of the offending address. Convert it to a full address before reporting it to device drivers. Fixes: eb8d93ea3c1d3 ("iommu/vt-d: Report page request faults for guest SVA") Signed-off-by: Lu Baolu --- drivers/iommu/intel/svm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index 54db58945c2d..677d7f6b43bb 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -862,7 +862,7 @@ intel_svm_prq_report(struct device *dev, struct page_req_dsc *desc) /* Fill in event data for device specific processing */ memset(, 0, sizeof(struct iommu_fault_event)); event.fault.type = IOMMU_FAULT_PAGE_REQ; - event.fault.prm.addr = desc->addr; + event.fault.prm.addr = (u64)desc->addr << VTD_PAGE_SHIFT; event.fault.prm.pasid = desc->pasid; event.fault.prm.grpid = desc->prg_index; event.fault.prm.perm = prq_to_iommu_prot(desc); -- 2.25.1
[PATCH v2 2/5] iommu/vt-d: Remove WO permissions on second-level paging entries
When the first level page table is used for IOVA translation, it only supports Read-Only and Read-Write permissions. The Write-Only permission is not supported as the PRESENT bit (implying Read permission) should always set. When using second level, we still give separate permissions that allows WriteOnly which seems inconsistent and awkward. We want to have consistent behavior. After moving to 1st level, we don't want things to work sometimes, and break if we use 2nd level for the same mappings. Hence remove this configuration. Suggested-by: Ashok Raj Fixes: b802d070a52a1 ("iommu/vt-d: Use iova over first level") Signed-off-by: Lu Baolu --- drivers/iommu/intel/iommu.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 167219ea8d70..132cbf9f214f 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -2304,8 +2304,9 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, return -EINVAL; attr = prot & (DMA_PTE_READ | DMA_PTE_WRITE | DMA_PTE_SNP); + attr |= DMA_FL_PTE_PRESENT; if (domain_use_first_level(domain)) { - attr |= DMA_FL_PTE_PRESENT | DMA_FL_PTE_XD | DMA_FL_PTE_US; + attr |= DMA_FL_PTE_XD | DMA_FL_PTE_US; if (domain->domain.type == IOMMU_DOMAIN_DMA) { attr |= DMA_FL_PTE_ACCESS; -- 2.25.1
[PATCH v2 1/1] iommu/vt-d: Don't set then clear private data in prq_event_thread()
The VT-d specification (section 7.6) requires that the value in the Private Data field of a Page Group Response Descriptor must match the value in the Private Data field of the respective Page Request Descriptor. The private data field of a page group response descriptor is set then immediately cleared in prq_event_thread(). This breaks the rule defined by the VT-d specification. Fix it by moving clearing code up. Fixes: 5b438f4ba315d ("iommu/vt-d: Support page request in scalable mode") Cc: Jacob Pan Reviewed-by: Liu Yi L Signed-off-by: Lu Baolu --- drivers/iommu/intel/svm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Log: v1->v2: - v1: https://lore.kernel.org/linux-iommu/20210309004641.3809653-1-baolu...@linux.intel.com/ - Refine the commit title to make the affected field clear. - Refine the commit message to declare why the change matters. diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index 677d7f6b43bb..5d590d63ab52 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -1034,12 +1034,12 @@ static irqreturn_t prq_event_thread(int irq, void *d) QI_PGRP_RESP_TYPE; resp.qw1 = QI_PGRP_IDX(req->prg_index) | QI_PGRP_LPIG(req->lpig); + resp.qw2 = 0; + resp.qw3 = 0; if (req->priv_data_present) memcpy(, req->priv_data, sizeof(req->priv_data)); - resp.qw2 = 0; - resp.qw3 = 0; qi_submit_sync(iommu, , 1, 0); } prq_advance: -- 2.25.1
[PATCH v2 1/1] iommu/vt-d: Fix lockdep splat in intel_pasid_get_entry()
The pasid_lock is used to synchronize different threads from modifying a same pasid directory entry at the same time. It causes below lockdep splat. [ 83.296538] [ 83.296538] WARNING: possible irq lock inversion dependency detected [ 83.296539] 5.12.0-rc3+ #25 Tainted: GW [ 83.296539] [ 83.296540] bash/780 just changed the state of lock: [ 83.296540] 82b29c98 (device_domain_lock){..-.}-{2:2}, at: iommu_flush_dev_iotlb.part.0+0x32/0x110 [ 83.296547] but this lock took another, SOFTIRQ-unsafe lock in the past: [ 83.296547] (pasid_lock){+.+.}-{2:2} [ 83.296548] and interrupts could create inverse lock ordering between them. [ 83.296549] other info that might help us debug this: [ 83.296549] Chain exists of: device_domain_lock --> >lock --> pasid_lock [ 83.296551] Possible interrupt unsafe locking scenario: [ 83.296551]CPU0CPU1 [ 83.296552] [ 83.296552] lock(pasid_lock); [ 83.296553]local_irq_disable(); [ 83.296553]lock(device_domain_lock); [ 83.296554]lock(>lock); [ 83.296554] [ 83.296554] lock(device_domain_lock); [ 83.296555] *** DEADLOCK *** Fix it by replacing the pasid_lock with an atomic exchange operation. Reported-and-tested-by: Dave Jiang Signed-off-by: Lu Baolu --- drivers/iommu/intel/pasid.c | 21 + 1 file changed, 13 insertions(+), 8 deletions(-) Log: v1->v2: - v1: https://lore.kernel.org/linux-iommu/20210317005834.173503-1-baolu...@linux.intel.com/ - Use retry to make code neat; - Add a comment about no clear case, hence no race. diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 7a73385edcc0..f2c747e62c6a 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -24,7 +24,6 @@ /* * Intel IOMMU system wide PASID name space: */ -static DEFINE_SPINLOCK(pasid_lock); u32 intel_pasid_max_id = PASID_MAX; int vcmd_alloc_pasid(struct intel_iommu *iommu, u32 *pasid) @@ -259,19 +258,25 @@ struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid) dir_index = pasid >> PASID_PDE_SHIFT; index = pasid & PASID_PTE_MASK; - spin_lock(_lock); +retry: entries = get_pasid_table_from_pde([dir_index]); if (!entries) { entries = alloc_pgtable_page(info->iommu->node); - if (!entries) { - spin_unlock(_lock); + if (!entries) return NULL; - } - WRITE_ONCE(dir[dir_index].val, - (u64)virt_to_phys(entries) | PASID_PTE_PRESENT); + /* +* The pasid directory table entry won't be freed after +* allocation. No worry about the race with free and +* clear. However, this entry might be populated by others +* while we are preparing it. Use theirs with a retry. +*/ + if (cmpxchg64([dir_index].val, 0ULL, + (u64)virt_to_phys(entries) | PASID_PTE_PRESENT)) { + free_pgtable_page(entries); + goto retry; + } } - spin_unlock(_lock); return [index]; } -- 2.25.1
Re: [RFC PATCH v1 0/4] vfio: Add IOPF support for VFIO passthrough
On 3/19/21 9:30 AM, Keqian Zhu wrote: Hi Baolu, On 2021/3/19 8:33, Lu Baolu wrote: On 3/18/21 7:53 PM, Shenming Lu wrote: On 2021/3/18 17:07, Tian, Kevin wrote: From: Shenming Lu Sent: Thursday, March 18, 2021 3:53 PM On 2021/2/4 14:52, Tian, Kevin wrote:>>> In reality, many devices allow I/O faulting only in selective contexts. However, there is no standard way (e.g. PCISIG) for the device to report whether arbitrary I/O fault is allowed. Then we may have to maintain device specific knowledge in software, e.g. in an opt-in table to list devices which allows arbitrary faults. For devices which only support selective faulting, a mediator (either through vendor extensions on vfio-pci-core or a mdev wrapper) might be necessary to help lock down non-faultable mappings and then enable faulting on the rest mappings. For devices which only support selective faulting, they could tell it to the IOMMU driver and let it filter out non-faultable faults? Do I get it wrong? Not exactly to IOMMU driver. There is already a vfio_pin_pages() for selectively page-pinning. The matter is that 'they' imply some device specific logic to decide which pages must be pinned and such knowledge is outside of VFIO. From enabling p.o.v we could possibly do it in phased approach. First handles devices which tolerate arbitrary DMA faults, and then extends to devices with selective-faulting. The former is simpler, but with one main open whether we want to maintain such device IDs in a static table in VFIO or rely on some hints from other components (e.g. PF driver in VF assignment case). Let's see how Alex thinks about it. Hi Kevin, You mentioned selective-faulting some time ago. I still have some doubt about it: There is already a vfio_pin_pages() which is used for limiting the IOMMU group dirty scope to pinned pages, could it also be used for indicating the faultable scope is limited to the pinned pages and the rest mappings is non-faultable that should be pinned and mapped immediately? But it seems to be a little weird and not exactly to what you meant... I will be grateful if you can help to explain further.:-) The opposite, i.e. the vendor driver uses vfio_pin_pages to lock down pages that are not faultable (based on its specific knowledge) and then the rest memory becomes faultable. Ahh... Thus, from the perspective of VFIO IOMMU, if IOPF enabled for such device, only the page faults within the pinned range are valid in the registered iommu fault handler... Isn't it opposite? The pinned pages will never generate any page faults. I might miss some contexts here. It seems that vfio_pin_pages() just pin some pages and record the pinned scope to pfn_list of vfio_dma. No mapping is established, so we still has page faults. Make sense. Thanks a lot for the explanation. IIUC, vfio_pin_pages() is used to 1. pin pages for non-iommu backed devices. 2. mark dirty scope for non-iommu backed devices and iommu backed devices. Best regards, baolu
Re: [PATCH 1/1] iommu/vt-d: Don't set then immediately clear in prq_event_thread()
Hi Joerg, On 3/18/21 6:10 PM, Joerg Roedel wrote: Hi Baolu, On Tue, Mar 09, 2021 at 08:46:41AM +0800, Lu Baolu wrote: The private data field of a page group response descriptor is set then immediately cleared in prq_event_thread(). Fix this by moving clearing code up. Fixes: 5b438f4ba315d ("iommu/vt-d: Support page request in scalable mode") Cc: Jacob Pan Reviewed-by: Liu Yi L Signed-off-by: Lu Baolu Does this fix an actual bug? If so, please state it in the commit It will cause real problem according to the VT-d spec. I haven't got a chance run this on a real hardware yet. I'll add a commit message to explain why this will cause problem. message and also fix the subject line to state what is set/cleared. Sure! Thanks, Joerg Best regards, baolu
Re: [PATCH 1/1] iommu/vt-d: Fix lockdep splat in intel_pasid_get_entry()
Hi Joerg, On 3/18/21 6:21 PM, Joerg Roedel wrote: On Wed, Mar 17, 2021 at 08:58:34AM +0800, Lu Baolu wrote: The pasid_lock is used to synchronize different threads from modifying a same pasid directory entry at the same time. It causes below lockdep splat. [ 83.296538] [ 83.296538] WARNING: possible irq lock inversion dependency detected [ 83.296539] 5.12.0-rc3+ #25 Tainted: GW [ 83.296539] [ 83.296540] bash/780 just changed the state of lock: [ 83.296540] 82b29c98 (device_domain_lock){..-.}-{2:2}, at: iommu_flush_dev_iotlb.part.0+0x32/0x110 [ 83.296547] but this lock took another, SOFTIRQ-unsafe lock in the past: [ 83.296547] (pasid_lock){+.+.}-{2:2} [ 83.296548] and interrupts could create inverse lock ordering between them. [ 83.296549] other info that might help us debug this: [ 83.296549] Chain exists of: device_domain_lock --> >lock --> pasid_lock [ 83.296551] Possible interrupt unsafe locking scenario: [ 83.296551]CPU0CPU1 [ 83.296552] [ 83.296552] lock(pasid_lock); [ 83.296553]local_irq_disable(); [ 83.296553]lock(device_domain_lock); [ 83.296554]lock(>lock); [ 83.296554] [ 83.296554] lock(device_domain_lock); [ 83.296555] *** DEADLOCK *** Fix it by replacing the pasid_lock with an atomic exchange operation. Reported-and-tested-by: Dave Jiang Signed-off-by: Lu Baolu --- drivers/iommu/intel/pasid.c | 14 ++ 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 9fb3d3e80408..1ddcb8295f72 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -24,7 +24,6 @@ /* * Intel IOMMU system wide PASID name space: */ -static DEFINE_SPINLOCK(pasid_lock); u32 intel_pasid_max_id = PASID_MAX; int vcmd_alloc_pasid(struct intel_iommu *iommu, u32 *pasid) @@ -259,19 +258,18 @@ struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid) dir_index = pasid >> PASID_PDE_SHIFT; index = pasid & PASID_PTE_MASK; - spin_lock(_lock); entries = get_pasid_table_from_pde([dir_index]); if (!entries) { entries = alloc_pgtable_page(info->iommu->node); - if (!entries) { - spin_unlock(_lock); + if (!entries) return NULL; - } - WRITE_ONCE(dir[dir_index].val, - (u64)virt_to_phys(entries) | PASID_PTE_PRESENT); + if (cmpxchg64([dir_index].val, 0ULL, + (u64)virt_to_phys(entries) | PASID_PTE_PRESENT)) { + free_pgtable_page(entries); + entries = get_pasid_table_from_pde([dir_index]); This is racy, someone could have already cleared the pasid-entry again. This code modifies the pasid directory entry. The pasid directory entries are allocated on demand and will never be freed. What you need to do here is to retry the whole path by adding a goto to before the first get_pasid_table_from_pde() call. Yes. Retrying by adding a goto makes the code clearer. Btw, what makes sure that the pasid_entry does not go away when it is returned here? As explained above, it handles the pasid directory table entry which won't go away. Regards, Joerg Best regards, baolu
Re: [RFC PATCH v1 0/4] vfio: Add IOPF support for VFIO passthrough
On 3/18/21 7:53 PM, Shenming Lu wrote: On 2021/3/18 17:07, Tian, Kevin wrote: From: Shenming Lu Sent: Thursday, March 18, 2021 3:53 PM On 2021/2/4 14:52, Tian, Kevin wrote:>>> In reality, many devices allow I/O faulting only in selective contexts. However, there is no standard way (e.g. PCISIG) for the device to report whether arbitrary I/O fault is allowed. Then we may have to maintain device specific knowledge in software, e.g. in an opt-in table to list devices which allows arbitrary faults. For devices which only support selective faulting, a mediator (either through vendor extensions on vfio-pci-core or a mdev wrapper) might be necessary to help lock down non-faultable mappings and then enable faulting on the rest mappings. For devices which only support selective faulting, they could tell it to the IOMMU driver and let it filter out non-faultable faults? Do I get it wrong? Not exactly to IOMMU driver. There is already a vfio_pin_pages() for selectively page-pinning. The matter is that 'they' imply some device specific logic to decide which pages must be pinned and such knowledge is outside of VFIO. From enabling p.o.v we could possibly do it in phased approach. First handles devices which tolerate arbitrary DMA faults, and then extends to devices with selective-faulting. The former is simpler, but with one main open whether we want to maintain such device IDs in a static table in VFIO or rely on some hints from other components (e.g. PF driver in VF assignment case). Let's see how Alex thinks about it. Hi Kevin, You mentioned selective-faulting some time ago. I still have some doubt about it: There is already a vfio_pin_pages() which is used for limiting the IOMMU group dirty scope to pinned pages, could it also be used for indicating the faultable scope is limited to the pinned pages and the rest mappings is non-faultable that should be pinned and mapped immediately? But it seems to be a little weird and not exactly to what you meant... I will be grateful if you can help to explain further. :-) The opposite, i.e. the vendor driver uses vfio_pin_pages to lock down pages that are not faultable (based on its specific knowledge) and then the rest memory becomes faultable. Ahh... Thus, from the perspective of VFIO IOMMU, if IOPF enabled for such device, only the page faults within the pinned range are valid in the registered iommu fault handler... Isn't it opposite? The pinned pages will never generate any page faults. I might miss some contexts here. I have another question here, for the IOMMU backed devices, they are already all pinned and mapped when attaching, is there a need to call vfio_pin_pages() to lock down pages for them? Did I miss something?... Best regards, baolu
Re: [PATCH 2/5] iommu/vt-d: Remove WO permissions on second-level paging entries
Hi Joerg, On 3/18/21 5:12 PM, Joerg Roedel wrote: Hi, On Mon, Mar 08, 2021 at 11:47:46AM -0800, Raj, Ashok wrote: That is the primary motivation, given that we have moved to 1st level for general IOVA, first level doesn't have a WO mapping. I didn't know enough about the history to determine if a WO without a READ is very useful. I guess the ZLR was invented to support those cases without a READ in PCIe. I Okay, please update the commit message and re-send. I guess these patches are 5.13 stuff. In that case, Baolu can include them into his pull request later this cycle. Okay! It works for me. Best regards, baolu
Re: A problem of Intel IOMMU hardware ?
On 3/18/21 4:56 PM, Tian, Kevin wrote: From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) -Original Message- From: Tian, Kevin [mailto:kevin.t...@intel.com] Sent: Thursday, March 18, 2021 4:27 PM To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) ; Nadav Amit Cc: chenjiashang ; David Woodhouse ; io...@lists.linux-foundation.org; LKML ; alex.william...@redhat.com; Gonglei (Arei) ; w...@kernel.org Subject: RE: A problem of Intel IOMMU hardware ? From: iommu On Behalf Of Longpeng (Mike, Cloud Infrastructure Service Product Dept.) 2. Consider ensuring that the problem is not somehow related to queued invalidations. Try to use __iommu_flush_iotlb() instead of qi_flush_iotlb(). I tried to force to use __iommu_flush_iotlb(), but maybe something wrong, the system crashed, so I prefer to lower the priority of this operation. The VT-d spec clearly says that register-based invalidation can be used only when queued-invalidations are not enabled. Intel-IOMMU driver doesn't provide an option to disable queued-invalidation though, when the hardware is capable. If you really want to try, tweak the code in intel_iommu_init_qi. Hi Kevin, Thanks to point out this. Do you have any ideas about this problem ? I tried to descript the problem much clear in my reply to Alex, hope you could have a look if you're interested. btw I saw you used 4.18 kernel in this test. What about latest kernel? Also one way to separate sw/hw bug is to trace the low level interface (e.g., qi_flush_iotlb) which actually sends invalidation descriptors to the IOMMU hardware. Check the window between b) and c) and see whether the software does the right thing as expected there. Yes. It's better if we can reproduce this with the latest kernel which has debugfs files to expose page tables and the invalidation queues etc. Best regards, baolu
Re: A problem of Intel IOMMU hardware ?
Hi Nadav, On 3/18/21 2:12 AM, Nadav Amit wrote: On Mar 17, 2021, at 2:35 AM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: Hi Nadav, -Original Message- From: Nadav Amit [mailto:nadav.a...@gmail.com] reproduce the problem with high probability (~50%). I saw Lu replied, and he is much more knowledgable than I am (I was just intrigued by your email). However, if I were you I would try also to remove some “optimizations” to look for the root-cause (e.g., use domain specific invalidations instead of page-specific). Good suggestion! But we did it these days, we tried to use global invalidations as follow: iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH); But can not resolve the problem. The first thing that comes to my mind is the invalidation hint (ih) in iommu_flush_iotlb_psi(). I would remove it to see whether you get the failure without it. We also notice the IH, but the IH is always ZERO in our case, as the spec says: ''' Paging-structure-cache entries caching second-level mappings associated with the specified domain-id and the second-level-input-address range are invalidated, if the Invalidation Hint (IH) field is Clear. ''' It seems the software is everything fine, so we've no choice but to suspect the hardware. Ok, I am pretty much out of ideas. I have two more suggestions, but they are much less likely to help. Yet, they can further help to rule out software bugs: 1. dma_clear_pte() seems to be wrong IMHO. It should have used WRITE_ONCE() to prevent split-write, which might potentially cause “invalid” (partially cleared) PTE to be stored in the TLB. Having said that, the subsequent IOTLB flush should have prevented the problem. Agreed. The pte read/write should use READ/WRITE_ONCE() instead. 2. Consider ensuring that the problem is not somehow related to queued invalidations. Try to use __iommu_flush_iotlb() instead of qi_flush_iotlb(). Regards, Nadav Best regards, baolu
Re: A problem of Intel IOMMU hardware ?
Hi Alex, On 3/17/21 11:18 PM, Alex Williamson wrote: {MAP, 0x0, 0xc000}, - (b) use GDB to pause at here, and then DMA read IOVA=0, IOVA 0 seems to be a special one. Have you verified with other addresses than IOVA 0? It is??? That would be a problem. No problem from hardware point of view as far as I can see. Just thought about software might handle it specially. Best regards, baolu
[PATCH 1/1] iommu/vt-d: Report more information about invalidation errors
When the invalidation queue errors are encountered, dump the information logged by the VT-d hardware together with the pending queue invalidation descriptors. Signed-off-by: Ashok Raj Tested-by: Guo Kaijie Signed-off-by: Lu Baolu --- drivers/iommu/intel/dmar.c | 68 ++--- include/linux/intel-iommu.h | 6 2 files changed, 70 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c index d5c51b5c20af..6971397805f3 100644 --- a/drivers/iommu/intel/dmar.c +++ b/drivers/iommu/intel/dmar.c @@ -1205,6 +1205,63 @@ static inline void reclaim_free_desc(struct q_inval *qi) } } +static const char *qi_type_string(u8 type) +{ + switch (type) { + case QI_CC_TYPE: + return "Context-cache Invalidation"; + case QI_IOTLB_TYPE: + return "IOTLB Invalidation"; + case QI_DIOTLB_TYPE: + return "Device-TLB Invalidation"; + case QI_IEC_TYPE: + return "Interrupt Entry Cache Invalidation"; + case QI_IWD_TYPE: + return "Invalidation Wait"; + case QI_EIOTLB_TYPE: + return "PASID-based IOTLB Invalidation"; + case QI_PC_TYPE: + return "PASID-cache Invalidation"; + case QI_DEIOTLB_TYPE: + return "PASID-based Device-TLB Invalidation"; + case QI_PGRP_RESP_TYPE: + return "Page Group Response"; + default: + return "UNKNOWN"; + } +} + +static void qi_dump_fault(struct intel_iommu *iommu, u32 fault) +{ + unsigned int head = dmar_readl(iommu->reg + DMAR_IQH_REG); + u64 iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG); + struct qi_desc *desc = iommu->qi->desc + head; + + if (fault & DMA_FSTS_IQE) + pr_err("VT-d detected Invalidation Queue Error: Reason %llx", + DMAR_IQER_REG_IQEI(iqe_err)); + if (fault & DMA_FSTS_ITE) + pr_err("VT-d detected Invalidation Time-out Error: SID %llx", + DMAR_IQER_REG_ITESID(iqe_err)); + if (fault & DMA_FSTS_ICE) + pr_err("VT-d detected Invalidation Completion Error: SID %llx", + DMAR_IQER_REG_ICESID(iqe_err)); + + pr_err("QI HEAD: %s qw0 = 0x%llx, qw1 = 0x%llx\n", + qi_type_string(desc->qw0 & 0xf), + (unsigned long long)desc->qw0, + (unsigned long long)desc->qw1); + + head = ((head >> qi_shift(iommu)) + QI_LENGTH - 1) % QI_LENGTH; + head <<= qi_shift(iommu); + desc = iommu->qi->desc + head; + + pr_err("QI PRIOR: %s qw0 = 0x%llx, qw1 = 0x%llx\n", + qi_type_string(desc->qw0 & 0xf), + (unsigned long long)desc->qw0, + (unsigned long long)desc->qw1); +} + static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index) { u32 fault; @@ -1216,6 +1273,8 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index) return -EAGAIN; fault = readl(iommu->reg + DMAR_FSTS_REG); + if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE)) + qi_dump_fault(iommu, fault); /* * If IQE happens, the head points to the descriptor associated @@ -1232,12 +1291,10 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index) * used by software as private data. We won't print * out these two qw's for security consideration. */ - pr_err("VT-d detected invalid descriptor: qw0 = %llx, qw1 = %llx\n", - (unsigned long long)desc->qw0, - (unsigned long long)desc->qw1); memcpy(desc, qi->desc + (wait_index << shift), 1 << shift); writel(DMA_FSTS_IQE, iommu->reg + DMAR_FSTS_REG); + pr_info("Invalidation Queue Error (IQE) cleared\n"); return -EINVAL; } } @@ -1254,6 +1311,7 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index) tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH; writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG); + pr_info("Invalidation Time-out Error (ITE) cleared\n"); do { if (qi->desc_status[head] == QI_IN_USE) @@ -1265,8 +1323,10 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index) return -EAGA
Re: A problem of Intel IOMMU hardware ?
Hi Longpeng, On 3/17/21 11:16 AM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: Hi guys, We find the Intel iommu cache (i.e. iotlb) maybe works wrong in a special situation, it would cause DMA fails or get wrong data. The reproducer (based on Alex's vfio testsuite[1]) is in attachment, it can reproduce the problem with high probability (~50%). The machine we used is: processor : 47 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz stepping: 4 microcode : 0x269 And the iommu capability reported is: ver 1:0 cap 8d2078c106f0466 ecap f020df (caching mode = 0 , page-selective invalidation = 1) (The problem is also on 'Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz' and 'Intel(R) Xeon(R) Platinum 8378A CPU @ 3.00GHz') We run the reproducer on Linux 4.18 and it works as follow: Step 1. alloc 4G *2M-hugetlb* memory (N.B. no problem with 4K-page mapping) I don't understand 2M-hugetlb here means exactly. The IOMMU hardware supports both 2M and 1G super page. The mapping physical memory is 4G. Why couldn't it use 1G super page? Step 2. DMA Map 4G memory Step 3. while (1) { {UNMAP, 0x0, 0xa}, (a) {UNMAP, 0xc, 0xbff4}, Have these two ranges been mapped before? Does the IOMMU driver complains when you trying to unmap a range which has never been mapped? The IOMMU driver implicitly assumes that mapping and unmapping are paired. {MAP, 0x0, 0xc000}, - (b) use GDB to pause at here, and then DMA read IOVA=0, IOVA 0 seems to be a special one. Have you verified with other addresses than IOVA 0? sometimes DMA success (as expected), but sometimes DMA error (report not-present). {UNMAP, 0x0, 0xc000}, - (c) {MAP, 0x0, 0xa}, {MAP, 0xc, 0xbff4}, } The DMA read operations sholud success between (b) and (c), it should NOT report not-present at least! After analysis the problem, we think maybe it's caused by the Intel iommu iotlb. It seems the DMA Remapping hardware still uses the IOTLB or other caches of (a). When do DMA unmap at (a), the iotlb will be flush: intel_iommu_unmap domain_unmap iommu_flush_iotlb_psi When do DMA map at (b), no need to flush the iotlb according to the capability of this iommu: intel_iommu_map domain_pfn_mapping domain_mapping __mapping_notify_one if (cap_caching_mode(iommu->cap)) // FALSE iommu_flush_iotlb_psi That's true. The iotlb flushing is not needed in case of PTE been changed from non-present to present unless caching mode. But the problem will disappear if we FORCE flush here. So we suspect the iommu hardware. Do you have any suggestion ? Best regards, baolu
[PATCH 1/1] iommu/vt-d: Fix lockdep splat in intel_pasid_get_entry()
The pasid_lock is used to synchronize different threads from modifying a same pasid directory entry at the same time. It causes below lockdep splat. [ 83.296538] [ 83.296538] WARNING: possible irq lock inversion dependency detected [ 83.296539] 5.12.0-rc3+ #25 Tainted: GW [ 83.296539] [ 83.296540] bash/780 just changed the state of lock: [ 83.296540] 82b29c98 (device_domain_lock){..-.}-{2:2}, at: iommu_flush_dev_iotlb.part.0+0x32/0x110 [ 83.296547] but this lock took another, SOFTIRQ-unsafe lock in the past: [ 83.296547] (pasid_lock){+.+.}-{2:2} [ 83.296548] and interrupts could create inverse lock ordering between them. [ 83.296549] other info that might help us debug this: [ 83.296549] Chain exists of: device_domain_lock --> >lock --> pasid_lock [ 83.296551] Possible interrupt unsafe locking scenario: [ 83.296551]CPU0CPU1 [ 83.296552] [ 83.296552] lock(pasid_lock); [ 83.296553]local_irq_disable(); [ 83.296553]lock(device_domain_lock); [ 83.296554]lock(>lock); [ 83.296554] [ 83.296554] lock(device_domain_lock); [ 83.296555] *** DEADLOCK *** Fix it by replacing the pasid_lock with an atomic exchange operation. Reported-and-tested-by: Dave Jiang Signed-off-by: Lu Baolu --- drivers/iommu/intel/pasid.c | 14 ++ 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 9fb3d3e80408..1ddcb8295f72 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -24,7 +24,6 @@ /* * Intel IOMMU system wide PASID name space: */ -static DEFINE_SPINLOCK(pasid_lock); u32 intel_pasid_max_id = PASID_MAX; int vcmd_alloc_pasid(struct intel_iommu *iommu, u32 *pasid) @@ -259,19 +258,18 @@ struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid) dir_index = pasid >> PASID_PDE_SHIFT; index = pasid & PASID_PTE_MASK; - spin_lock(_lock); entries = get_pasid_table_from_pde([dir_index]); if (!entries) { entries = alloc_pgtable_page(info->iommu->node); - if (!entries) { - spin_unlock(_lock); + if (!entries) return NULL; - } - WRITE_ONCE(dir[dir_index].val, - (u64)virt_to_phys(entries) | PASID_PTE_PRESENT); + if (cmpxchg64([dir_index].val, 0ULL, + (u64)virt_to_phys(entries) | PASID_PTE_PRESENT)) { + free_pgtable_page(entries); + entries = get_pasid_table_from_pde([dir_index]); + } } - spin_unlock(_lock); return [index]; } -- 2.25.1
Re: [PATCH] iommu/vt-d: Disable SVM when ATS/PRI/PASID are not enabled in the device
On 3/15/21 4:15 AM, Kyung Min Park wrote: Currently, the Intel VT-d supports Shared Virtual Memory (SVM) only when IO page fault is supported. Otherwise, shared memory pages can not be swapped out and need to be pinned. The device needs the Address Translation Service (ATS), Page Request Interface (PRI) and Process Address Space Identifier (PASID) capabilities to be enabled to support IO page fault. Disable SVM when ATS, PRI and PASID are not enabled in the device. Signed-off-by: Kyung Min Park --- drivers/iommu/intel/iommu.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index ee0932307d64..956a02eb40b4 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -5380,6 +5380,9 @@ intel_iommu_dev_enable_feat(struct device *dev, enum iommu_dev_features feat) if (!info) return -EINVAL; + if (!info->pasid_enabled || !info->pri_enabled || !info->ats_enabled) + return -EINVAL; + if (info->iommu->flags & VTD_FLAG_SVM_CAPABLE) return 0; } Thanks for the patch. Acked-by: Lu Baolu Best regards, baolu
Re: [RFC PATCH v2 1/6] iommu: Evolve to support more scenarios of using IOPF
Hi Shenming, On 3/9/21 2:22 PM, Shenming Lu wrote: This patch follows the discussion here: https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/ In order to support more scenarios of using IOPF (mainly consider the nested extension), besides keeping IOMMU_DEV_FEAT_IOPF as a general capability for whether delivering faults through the IOMMU, we extend iommu_register_fault_handler() with flags and introduce IOPF_REPORT_FLAT and IOPF_REPORT_NESTED to describe the page fault reporting capability under a specific configuration. IOPF_REPORT_NESTED needs additional info to indicate which level/stage is concerned since the fault client may be interested in only one level. Signed-off-by: Shenming Lu --- .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 3 +- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 ++-- drivers/iommu/io-pgfault.c| 4 -- drivers/iommu/iommu.c | 56 ++- include/linux/iommu.h | 21 ++- include/uapi/linux/iommu.h| 3 + 6 files changed, 85 insertions(+), 13 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c index ee66d1f4cb81..5de9432349d4 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c @@ -482,7 +482,8 @@ static int arm_smmu_master_sva_enable_iopf(struct arm_smmu_master *master) if (ret) return ret; - ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf, dev); + ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf, + IOPF_REPORT_FLAT, dev); if (ret) { iopf_queue_remove_device(master->smmu->evtq.iopf, dev); return ret; diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 363744df8d51..f40529d0075d 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -1447,10 +1447,6 @@ static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64 *evt) return -EOPNOTSUPP; } - /* Stage-2 is always pinned at the moment */ - if (evt[1] & EVTQ_1_S2) - return -EFAULT; - if (evt[1] & EVTQ_1_RnW) perm |= IOMMU_FAULT_PERM_READ; else @@ -1468,13 +1464,18 @@ static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64 *evt) .flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE, .grpid = FIELD_GET(EVTQ_1_STAG, evt[1]), .perm = perm, - .addr = FIELD_GET(EVTQ_2_ADDR, evt[2]), }; if (ssid_valid) { flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID; flt->prm.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]); } + + if (evt[1] & EVTQ_1_S2) { + flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_L2; + flt->prm.addr = FIELD_GET(EVTQ_3_IPA, evt[3]); + } else + flt->prm.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]); } else { flt->type = IOMMU_FAULT_DMA_UNRECOV; flt->event = (struct iommu_fault_unrecoverable) { diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c index 1df8c1dcae77..abf16e06bcf5 100644 --- a/drivers/iommu/io-pgfault.c +++ b/drivers/iommu/io-pgfault.c @@ -195,10 +195,6 @@ int iommu_queue_iopf(struct iommu_fault *fault, void *cookie) lockdep_assert_held(>lock); - if (fault->type != IOMMU_FAULT_PAGE_REQ) - /* Not a recoverable page fault */ - return -EOPNOTSUPP; - Any reasons why do you want to remove this check? /* * As long as we're holding param->lock, the queue can't be unlinked * from the device and therefore cannot disappear. diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index d0b0a15dba84..cb1d93b00f7d 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1056,6 +1056,40 @@ int iommu_group_unregister_notifier(struct iommu_group *group, } EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier); +/* + * iommu_update_device_fault_handler - Update the device fault handler via flags + * @dev: the device + * @mask: bits(not set) to clear + * @set: bits to set + * + * Update the device fault handler installed by + * iommu_register_device_fault_handler(). + * + * Return 0 on success, or an error. + */ +int iommu_update_device_fault_handler(struct device *dev, u32 mask, u32 set) +{ + struct dev_iommu *param = dev->iommu; + int ret = 0; + + if (!param) + return -EINVAL; + + mutex_lock(>lock); + + if (param->fault_param) { + ret = -EINVAL; + goto
[PATCH 1/1] iommu/vt-d: Don't set then immediately clear in prq_event_thread()
The private data field of a page group response descriptor is set then immediately cleared in prq_event_thread(). Fix this by moving clearing code up. Fixes: 5b438f4ba315d ("iommu/vt-d: Support page request in scalable mode") Cc: Jacob Pan Reviewed-by: Liu Yi L Signed-off-by: Lu Baolu --- drivers/iommu/intel/svm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index d76cc79f3496..f688c5f29b1a 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -1021,12 +1021,12 @@ static irqreturn_t prq_event_thread(int irq, void *d) QI_PGRP_RESP_TYPE; resp.qw1 = QI_PGRP_IDX(req->prg_index) | QI_PGRP_LPIG(req->lpig); + resp.qw2 = 0; + resp.qw3 = 0; if (req->priv_data_present) memcpy(, req->priv_data, sizeof(req->priv_data)); - resp.qw2 = 0; - resp.qw3 = 0; qi_submit_sync(iommu, , 1, 0); } prq_advance: -- 2.25.1
Re: [PATCH 2/5] iommu/vt-d: Remove WO permissions on second-level paging entries
Hi Joerg, On 3/4/21 8:26 PM, Joerg Roedel wrote: On Thu, Feb 25, 2021 at 02:26:51PM +0800, Lu Baolu wrote: When the first level page table is used for IOVA translation, it only supports Read-Only and Read-Write permissions. The Write-Only permission is not supported as the PRESENT bit (implying Read permission) should always set. When using second level, we still give separate permissions that allows WriteOnly which seems inconsistent and awkward. There is no use case we can think off, hence remove that configuration to make it consistent. No use-case for WriteOnly mappings? How about DMA_FROM_DEVICE mappings? The statement of no use case is not correct. Sorry about it. As we have moved to use first level for IOVA translation, the first level page table entry only provides RO and RW permissions. So if any device driver specifies DMA_FROM_DEVICE attribution, it will get RW permission in the page table. This patch aims to make the permissions of second level and first level consistent. No impact on the use of DMA_FROM_DEVICE attribution. Best regards, baolu
Re: [PATCH] iommu/dma: Resurrect the "forcedac" option
devices\n"); - dmar_forcedac = 1; + pr_warn("intel_iommu=forcedac deprecated; use iommu.forcedac instead\n"); + iommu_dma_forcedac = true; } else if (!strncmp(str, "strict", 6)) { pr_info("Disable batched IOTLB flush\n"); intel_iommu_strict = 1; diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h index 706b68d1359b..13d1f4c14d7b 100644 --- a/include/linux/dma-iommu.h +++ b/include/linux/dma-iommu.h @@ -40,6 +40,8 @@ void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list); void iommu_dma_free_cpu_cached_iovas(unsigned int cpu, struct iommu_domain *domain); +extern bool iommu_dma_forcedac; + #else /* CONFIG_IOMMU_DMA */ struct iommu_domain; Thanks for catching this. Acked-by: Lu Baolu Best regards, baolu
Re: [PATCH v2 1/4] iommu/vt-d: Enable write protect for supervisor SVM
On 3/2/21 6:13 PM, Jacob Pan wrote: Write protect bit, when set, inhibits supervisor writes to the read-only pages. In supervisor shared virtual addressing (SVA), where page tables are shared between CPU and DMA, IOMMU PASID entry WPE bit should match CR0.WP bit in the CPU. This patch sets WPE bit for supervisor PASIDs if CR0.WP is set. Signed-off-by: Sanjay Kumar Signed-off-by: Jacob Pan --- drivers/iommu/intel/pasid.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 0cceaabc3ce6..0b7e0e726ade 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -410,6 +410,15 @@ static inline void pasid_set_sre(struct pasid_entry *pe) pasid_set_bits(>val[2], 1 << 0, 1); } +/* + * Setup the WPE(Write Protect Enable) field (Bit 132) of a + * scalable mode PASID entry. + */ +static inline void pasid_set_wpe(struct pasid_entry *pe) +{ + pasid_set_bits(>val[2], 1 << 4, 1 << 4); +} + /* * Setup the P(Present) field (Bit 0) of a scalable mode PASID * entry. @@ -553,6 +562,20 @@ static void pasid_flush_caches(struct intel_iommu *iommu, } } +static inline int pasid_enable_wpe(struct pasid_entry *pte) +{ + unsigned long cr0 = read_cr0(); + + /* CR0.WP is normally set but just to be sure */ + if (unlikely(!(cr0 & X86_CR0_WP))) { + pr_err_ratelimited("No CPU write protect!\n"); + return -EINVAL; + } + pasid_set_wpe(pte); + + return 0; +}; + /* * Set up the scalable mode pasid table entry for first only * translation type. @@ -584,6 +607,9 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu, return -EINVAL; } pasid_set_sre(pte); + if (pasid_enable_wpe(pte)) + return -EINVAL; + } if (flags & PASID_FLAG_FL5LP) { Acked-by: Lu Baolu Best regards, baolu
Re: [PATCH v2 2/4] iommu/vt-d: Enable write protect propagation from guest
On 3/2/21 6:13 PM, Jacob Pan wrote: Write protect bit, when set, inhibits supervisor writes to the read-only pages. In guest supervisor shared virtual addressing (SVA), write-protect should be honored upon guest bind supervisor PASID request. This patch extends the VT-d portion of the IOMMU UAPI to include WP bit. WPE bit of the supervisor PASID entry will be set to match CPU CR0.WP bit. Signed-off-by: Sanjay Kumar Signed-off-by: Jacob Pan --- drivers/iommu/intel/pasid.c | 3 +++ include/uapi/linux/iommu.h | 3 ++- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 0b7e0e726ade..b7e39239f539 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -763,6 +763,9 @@ intel_pasid_setup_bind_data(struct intel_iommu *iommu, struct pasid_entry *pte, return -EINVAL; } pasid_set_sre(pte); + /* Enable write protect WP if guest requested */ + if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_WPE) + pasid_set_wpe(pte); } if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EAFE) { diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h index 35d48843acd8..3a9164cc9937 100644 --- a/include/uapi/linux/iommu.h +++ b/include/uapi/linux/iommu.h @@ -288,7 +288,8 @@ struct iommu_gpasid_bind_data_vtd { #define IOMMU_SVA_VTD_GPASID_PWT (1 << 3) /* page-level write through */ #define IOMMU_SVA_VTD_GPASID_EMTE (1 << 4) /* extended mem type enable */ #define IOMMU_SVA_VTD_GPASID_CD (1 << 5) /* PASID-level cache disable */ -#define IOMMU_SVA_VTD_GPASID_LAST (1 << 6) +#define IOMMU_SVA_VTD_GPASID_WPE (1 << 6) /* Write protect enable */ +#define IOMMU_SVA_VTD_GPASID_LAST (1 << 7) __u64 flags; __u32 pat; __u32 emt; Acked-by: Lu Baolu Best regards, baolu
Re: [PATCH] iommu/vt-d: Fix status code for Allocate/Free PASID command
On 2/27/21 3:39 PM, Zenghui Yu wrote: As per Intel vt-d spec, Rev 3.0 (section 10.4.45 "Virtual Command Response Register"), the status code of "No PASID available" error in response to the Allocate PASID command is 2, not 1. The same for "Invalid PASID" error in response to the Free PASID command. We will otherwise see confusing kernel log under the command failure from guest side. Fix it. Fixes: 24f27d32ab6b ("iommu/vt-d: Enlightened PASID allocation") Signed-off-by: Zenghui Yu --- drivers/iommu/intel/pasid.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h index 97dfcffbf495..444c0bec221a 100644 --- a/drivers/iommu/intel/pasid.h +++ b/drivers/iommu/intel/pasid.h @@ -30,8 +30,8 @@ #define VCMD_VRSP_IP 0x1 #define VCMD_VRSP_SC(e) (((e) >> 1) & 0x3) #define VCMD_VRSP_SC_SUCCESS 0 -#define VCMD_VRSP_SC_NO_PASID_AVAIL1 -#define VCMD_VRSP_SC_INVALID_PASID 1 +#define VCMD_VRSP_SC_NO_PASID_AVAIL2 +#define VCMD_VRSP_SC_INVALID_PASID 2 #define VCMD_VRSP_RESULT_PASID(e) (((e) >> 8) & 0xf) #define VCMD_CMD_OPERAND(e) ((e) << 8) /* Thanks a lot for catching this. Acked-by: Lu Baolu Best regards, baolu
[PATCH 5/5] iommu/vt-d: Avoid unnecessary cache flush in pasid entry teardown
When a present pasid entry is disassembled, all kinds of pasid related caches need to be flushed. But when a pasid entry is not being used (PRESENT bit not set), we don't need to do this. Check the PRESENT bit in intel_pasid_tear_down_entry() and avoid flushing caches if it's not set. Signed-off-by: Lu Baolu --- drivers/iommu/intel/pasid.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 07531e5edfa2..9fb3d3e80408 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -493,6 +493,9 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev, if (WARN_ON(!pte)) return; + if (!(pte->val[0] & PASID_PTE_PRESENT)) + return; + did = pasid_get_domain_id(pte); intel_pasid_clear_entry(dev, pasid, fault_ignore); -- 2.25.1
[PATCH 4/5] iommu/vt-d: Use user privilege for RID2PASID translation
When first-level page tables are used for IOVA translation, we use user privilege by setting U/S bit in the page table entry. This is to make it consistent with the second level translation, where the U/S enforcement is not available. Clear the SRE (Supervisor Request Enable) field in the pasid table entry of RID2PASID so that requests requesting the supervisor privilege are blocked and treated as DMA remapping faults. Suggested-by: Jacob Pan Fixes: b802d070a52a1 ("iommu/vt-d: Use iova over first level") Signed-off-by: Lu Baolu --- drivers/iommu/intel/iommu.c | 7 +-- drivers/iommu/intel/pasid.c | 3 ++- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index f41b184ce6eb..b14427d8121f 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -2495,9 +2495,9 @@ static int domain_setup_first_level(struct intel_iommu *iommu, struct device *dev, u32 pasid) { - int flags = PASID_FLAG_SUPERVISOR_MODE; struct dma_pte *pgd = domain->pgd; int agaw, level; + int flags = 0; /* * Skip top levels of page tables for iommu which has @@ -2513,7 +2513,10 @@ static int domain_setup_first_level(struct intel_iommu *iommu, if (level != 4 && level != 5) return -EINVAL; - flags |= (level == 5) ? PASID_FLAG_FL5LP : 0; + if (pasid != PASID_RID2PASID) + flags |= PASID_FLAG_SUPERVISOR_MODE; + if (level == 5) + flags |= PASID_FLAG_FL5LP; return intel_pasid_setup_first_level(iommu, dev, (pgd_t *)pgd, pasid, domain->iommu_did[iommu->seq_id], diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index f26cb6195b2c..07531e5edfa2 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -647,7 +647,8 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu, * Since it is a second level only translation setup, we should * set SRE bit as well (addresses are expected to be GPAs). */ - pasid_set_sre(pte); + if (pasid != PASID_RID2PASID) + pasid_set_sre(pte); pasid_set_present(pte); pasid_flush_caches(iommu, pte, pasid, did); -- 2.25.1
[PATCH 2/5] iommu/vt-d: Remove WO permissions on second-level paging entries
When the first level page table is used for IOVA translation, it only supports Read-Only and Read-Write permissions. The Write-Only permission is not supported as the PRESENT bit (implying Read permission) should always set. When using second level, we still give separate permissions that allows WriteOnly which seems inconsistent and awkward. There is no use case we can think off, hence remove that configuration to make it consistent. Suggested-by: Ashok Raj Fixes: b802d070a52a1 ("iommu/vt-d: Use iova over first level") Signed-off-by: Lu Baolu --- drivers/iommu/intel/iommu.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index ee0932307d64..19b3fd0d035b 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -2305,8 +2305,9 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, return -EINVAL; attr = prot & (DMA_PTE_READ | DMA_PTE_WRITE | DMA_PTE_SNP); + attr |= DMA_FL_PTE_PRESENT; if (domain_use_first_level(domain)) { - attr |= DMA_FL_PTE_PRESENT | DMA_FL_PTE_XD | DMA_FL_PTE_US; + attr |= DMA_FL_PTE_XD | DMA_FL_PTE_US; if (domain->domain.type == IOMMU_DOMAIN_DMA) { attr |= DMA_FL_PTE_ACCESS; -- 2.25.1
[PATCH 1/5] iommu/vt-d: Report the right page fault address
The Address field of the Page Request Descriptor only keeps bit [63:12] of the offending address. Convert it to a full address before reporting it to device drivers. Fixes: eb8d93ea3c1d3 ("iommu/vt-d: Report page request faults for guest SVA") Signed-off-by: Lu Baolu --- drivers/iommu/intel/svm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index 574a7e657a9a..d76cc79f3496 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -862,7 +862,7 @@ intel_svm_prq_report(struct device *dev, struct page_req_dsc *desc) /* Fill in event data for device specific processing */ memset(, 0, sizeof(struct iommu_fault_event)); event.fault.type = IOMMU_FAULT_PAGE_REQ; - event.fault.prm.addr = desc->addr; + event.fault.prm.addr = (u64)desc->addr << VTD_PAGE_SHIFT; event.fault.prm.pasid = desc->pasid; event.fault.prm.grpid = desc->prg_index; event.fault.prm.perm = prq_to_iommu_prot(desc); -- 2.25.1
[PATCH 3/5] iommu/vt-d: Invalidate PASID cache when root/context entry changed
When the Intel IOMMU is operating in the scalable mode, some information from the root and context table may be used to tag entries in the PASID cache. Software should invalidate the PASID-cache when changing root or context table entries. Suggested-by: Ashok Raj Fixes: 7373a8cc38197 ("iommu/vt-d: Setup context and enable RID2PASID support") Signed-off-by: Lu Baolu --- drivers/iommu/intel/iommu.c | 18 +- include/linux/intel-iommu.h | 1 + 2 files changed, 10 insertions(+), 9 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 19b3fd0d035b..f41b184ce6eb 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -1340,6 +1340,11 @@ static void iommu_set_root_entry(struct intel_iommu *iommu) readl, (sts & DMA_GSTS_RTPS), sts); raw_spin_unlock_irqrestore(>register_lock, flag); + + iommu->flush.flush_context(iommu, 0, 0, 0, DMA_CCMD_GLOBAL_INVL); + if (sm_supported(iommu)) + qi_flush_pasid_cache(iommu, 0, QI_PC_GLOBAL, 0); + iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH); } void iommu_flush_write_buffer(struct intel_iommu *iommu) @@ -2423,6 +2428,10 @@ static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 devfn (((u16)bus) << 8) | devfn, DMA_CCMD_MASK_NOBIT, DMA_CCMD_DEVICE_INVL); + + if (sm_supported(iommu)) + qi_flush_pasid_cache(iommu, did_old, QI_PC_ALL_PASIDS, 0); + iommu->flush.flush_iotlb(iommu, did_old, 0, @@ -3268,8 +3277,6 @@ static int __init init_dmars(void) register_pasid_allocator(iommu); #endif iommu_set_root_entry(iommu); - iommu->flush.flush_context(iommu, 0, 0, 0, DMA_CCMD_GLOBAL_INVL); - iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH); } #ifdef CONFIG_INTEL_IOMMU_BROKEN_GFX_WA @@ -3459,12 +3466,7 @@ static int init_iommu_hw(void) } iommu_flush_write_buffer(iommu); - iommu_set_root_entry(iommu); - - iommu->flush.flush_context(iommu, 0, 0, 0, - DMA_CCMD_GLOBAL_INVL); - iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH); iommu_enable_translation(iommu); iommu_disable_protect_mem_regions(iommu); } @@ -3847,8 +3849,6 @@ static int intel_iommu_add(struct dmar_drhd_unit *dmaru) goto disable_iommu; iommu_set_root_entry(iommu); - iommu->flush.flush_context(iommu, 0, 0, 0, DMA_CCMD_GLOBAL_INVL); - iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH); iommu_enable_translation(iommu); iommu_disable_protect_mem_regions(iommu); diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h index 1bc46b88711a..d1f32b33415a 100644 --- a/include/linux/intel-iommu.h +++ b/include/linux/intel-iommu.h @@ -372,6 +372,7 @@ enum { /* PASID cache invalidation granu */ #define QI_PC_ALL_PASIDS 0 #define QI_PC_PASID_SEL1 +#define QI_PC_GLOBAL 3 #define QI_EIOTLB_ADDR(addr) ((u64)(addr) & VTD_PAGE_MASK) #define QI_EIOTLB_IH(ih) (((u64)ih) << 6) -- 2.25.1
[PATCH 0/5] iommu/vt-d: Several misc fixes
Hi Joerg, This series includes some misc fixes for the VT-d iommu driver. Please help to review and merge. Best regards, baolu Lu Baolu (5): iommu/vt-d: Report the right page fault address iommu/vt-d: Remove WO permissions on second-level paging entries iommu/vt-d: Invalidate PASID cache when root/context entry changed iommu/vt-d: Use user privilege for RID2PASID translation iommu/vt-d: Avoid unnecessary cache flush in pasid entry teardown drivers/iommu/intel/iommu.c | 28 drivers/iommu/intel/pasid.c | 6 +- drivers/iommu/intel/svm.c | 2 +- include/linux/intel-iommu.h | 1 + 4 files changed, 23 insertions(+), 14 deletions(-) -- 2.25.1
[PATCH 1/1] iommu: Don't use lazy flush for untrusted device
The lazy IOTLB flushing setup leaves a time window, in which the device can still access some system memory, which has already been unmapped by the device driver. It's not suitable for untrusted devices. A malicious device might use this to attack the system by obtaining data that it shouldn't obtain. Fixes: c588072bba6b5 ("iommu/vt-d: Convert intel iommu driver to the iommu ops") Signed-off-by: Lu Baolu --- drivers/iommu/dma-iommu.c | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index f659395e7959..65234e383d6b 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -311,6 +311,11 @@ static void iommu_dma_flush_iotlb_all(struct iova_domain *iovad) domain->ops->flush_iotlb_all(domain); } +static bool dev_is_untrusted(struct device *dev) +{ + return dev_is_pci(dev) && to_pci_dev(dev)->untrusted; +} + /** * iommu_dma_init_domain - Initialise a DMA mapping domain * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie() @@ -365,8 +370,9 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, init_iova_domain(iovad, 1UL << order, base_pfn); - if (!cookie->fq_domain && !iommu_domain_get_attr(domain, - DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE, ) && attr) { + if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) && + !iommu_domain_get_attr(domain, DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE, ) && + attr) { if (init_iova_flush_queue(iovad, iommu_dma_flush_iotlb_all, iommu_dma_entry_dtor)) pr_warn("iova flush queue initialization failed\n"); @@ -508,11 +514,6 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr, iova_align(iovad, size), dir, attrs); } -static bool dev_is_untrusted(struct device *dev) -{ - return dev_is_pci(dev) && to_pci_dev(dev)->untrusted; -} - static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys, size_t size, int prot, u64 dma_mask) { -- 2.25.1
Re: [PATCH 4/4] iommu/vt-d: Calculate and set flags for handle_mm_fault
On 2/19/21 5:31 AM, Jacob Pan wrote: Page requests are originated from the user page fault. Therefore, we shall set FAULT_FLAG_USER. FAULT_FLAG_REMOTE indicates that we are walking an mm which is not guaranteed to be the same as the current->mm and should not be subject to protection key enforcement. Therefore, we should set FAULT_FLAG_REMOTE to avoid faults when both SVM and PKEY are used. References: commit 1b2ee1266ea6 ("mm/core: Do not enforce PKEY permissions on remote mm access") Reviewed-by: Raj Ashok Signed-off-by: Jacob Pan Acked-by: Lu Baolu Best regards, baolu --- drivers/iommu/intel/svm.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index ff7ae7cc17d5..7bfd20a24a60 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -1086,6 +1086,7 @@ static irqreturn_t prq_event_thread(int irq, void *d) struct intel_iommu *iommu = d; struct intel_svm *svm = NULL; int head, tail, handled = 0; + unsigned int flags = 0; /* Clear PPR bit before reading head/tail registers, to * ensure that we get a new interrupt if needed. */ @@ -1186,9 +1187,11 @@ static irqreturn_t prq_event_thread(int irq, void *d) if (access_error(vma, req)) goto invalid; - ret = handle_mm_fault(vma, address, - req->wr_req ? FAULT_FLAG_WRITE : 0, - NULL); + flags = FAULT_FLAG_USER | FAULT_FLAG_REMOTE; + if (req->wr_req) + flags |= FAULT_FLAG_WRITE; + + ret = handle_mm_fault(vma, address, flags, NULL); if (ret & VM_FAULT_ERROR) goto invalid;
Re: [PATCH 3/4] iommu/vt-d: Reject unsupported page request modes
On 2/19/21 5:31 AM, Jacob Pan wrote: When supervisor/privilige mode SVM is used, we bind init_mm.pgd with a supervisor PASID. There should not be any page fault for init_mm. Execution request with DMA read is also not supported. This patch checks PRQ descriptor for both unsupported configurations, reject them both with invalid responses. Signed-off-by: Jacob Pan Fixes: 1c4f88b7f1f92 ("iommu/vt-d: Shared virtual address in scalable mode") Acked-by: Lu Baolu Best regards, baolu --- drivers/iommu/intel/svm.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index 23a1e4f58c54..ff7ae7cc17d5 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -1113,7 +1113,17 @@ static irqreturn_t prq_event_thread(int irq, void *d) ((unsigned long long *)req)[1]); goto no_pasid; } - + /* We shall not receive page request for supervisor SVM */ + if (req->pm_req && (req->rd_req | req->wr_req)) { + pr_err("Unexpected page request in Privilege Mode"); + /* No need to find the matching sdev as for bad_req */ + goto no_pasid; + } + /* DMA read with exec requeset is not supported. */ + if (req->exe_req && req->rd_req) { + pr_err("Execution request not supported\n"); + goto no_pasid; + } if (!svm || svm->pasid != req->pasid) { rcu_read_lock(); svm = ioasid_find(NULL, req->pasid, NULL);
Re: [PATCH 1/4] iommu/vt-d: Enable write protect for supervisor SVM
Hi Jacob and Sanjay, On 2/19/21 5:31 AM, Jacob Pan wrote: Write protect bit, when set, inhibits supervisor writes to the read-only pages. In supervisor shared virtual addressing (SVA), where page tables are shared between CPU and DMA, IOMMU PASID entry WPE bit should match CR0.WP bit in the CPU. This patch sets WPE bit for supervisor PASIDs if CR0.WP is set. From reading the commit message, the intention of this patch is to match PASID entry WPE bith with CPU CR0.WP if 1) SRE is set (supervisor pasid); 2) page table is shared between CPU and IOMMU. Do I understand it right? But what the real code doing is failing pasid entry setup for first level translation if CPU CR0.WP is not set. It's not consistent with what described above. What I am thinking is that, as long as SRE is set, we should always set WPE in intel_pasid_setup_first_level(). For supervisor SVA case, we should check CPU CR0.WP in intel_svm_bind_mm() and abort binding if CR0.WP is not set. Thought? Best regards, baolu Signed-off-by: Sanjay Kumar Signed-off-by: Jacob Pan --- drivers/iommu/intel/pasid.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 0cceaabc3ce6..0b7e0e726ade 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -410,6 +410,15 @@ static inline void pasid_set_sre(struct pasid_entry *pe) pasid_set_bits(>val[2], 1 << 0, 1); } +/* + * Setup the WPE(Write Protect Enable) field (Bit 132) of a + * scalable mode PASID entry. + */ +static inline void pasid_set_wpe(struct pasid_entry *pe) +{ + pasid_set_bits(>val[2], 1 << 4, 1 << 4); +} + /* * Setup the P(Present) field (Bit 0) of a scalable mode PASID * entry. @@ -553,6 +562,20 @@ static void pasid_flush_caches(struct intel_iommu *iommu, } } +static inline int pasid_enable_wpe(struct pasid_entry *pte) +{ + unsigned long cr0 = read_cr0(); + + /* CR0.WP is normally set but just to be sure */ + if (unlikely(!(cr0 & X86_CR0_WP))) { + pr_err_ratelimited("No CPU write protect!\n"); + return -EINVAL; + } + pasid_set_wpe(pte); + + return 0; +}; + /* * Set up the scalable mode pasid table entry for first only * translation type. @@ -584,6 +607,9 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu, return -EINVAL; } pasid_set_sre(pte); + if (pasid_enable_wpe(pte)) + return -EINVAL; + } if (flags & PASID_FLAG_FL5LP) {
Re: [PATCH 11/12] platform-msi: Add platform check for subdevice irq domain
Hi Leon, On 2/8/21 4:21 PM, Leon Romanovsky wrote: On Wed, Feb 03, 2021 at 12:56:44PM -0800, Megha Dey wrote: From: Lu Baolu The pci_subdevice_msi_create_irq_domain() should fail if the underlying platform is not able to support IMS (Interrupt Message Storage). Otherwise, the isolation of interrupt is not guaranteed. For x86, IMS is only supported on bare metal for now. We could enable it in the virtualization environments in the future if interrupt HYPERCALL domain is supported or the hardware has the capability of interrupt isolation for subdevices. Cc: David Woodhouse Cc: Leon Romanovsky Cc: Kevin Tian Suggested-by: Thomas Gleixner Link: https://lore.kernel.org/linux-pci/87pn4nk7nn@nanos.tec.linutronix.de/ Link: https://lore.kernel.org/linux-pci/877dqrnzr3@nanos.tec.linutronix.de/ Link: https://lore.kernel.org/linux-pci/877dqqmc2h@nanos.tec.linutronix.de/ Signed-off-by: Lu Baolu Signed-off-by: Megha Dey --- arch/x86/pci/common.c | 74 + drivers/base/platform-msi.c | 8 + include/linux/msi.h | 1 + 3 files changed, 83 insertions(+) diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c index 3507f45..263ccf6 100644 --- a/arch/x86/pci/common.c +++ b/arch/x86/pci/common.c @@ -12,6 +12,8 @@ #include #include #include +#include +#include #include #include @@ -724,3 +726,75 @@ struct pci_dev *pci_real_dma_dev(struct pci_dev *dev) return dev; } #endif + +#ifdef CONFIG_DEVICE_MSI Sorry for my naive question, but I see it in all your patches in this series and wonder why did you wrap everything with ifdefs?. The added code is only called when DEVICE_MSI is configured. All *.c code is wrapped with those ifdefs, which is hard to navigate and unlikely to give any code/size optimization benefit if kernel is compiled without CONFIG_DEVICE_MSI. The more common approach is to put those ifdef in the public header files and leave to the compiler to drop not called functions. Yes. This looks better. Thanks Best regards, baolu
Re: [PATCH 11/12] platform-msi: Add platform check for subdevice irq domain
Hi Jason, On 2021/2/4 20:14, Jason Gunthorpe wrote: On Wed, Feb 03, 2021 at 12:56:44PM -0800, Megha Dey wrote: +bool arch_support_pci_device_ims(struct pci_dev *pdev) +{ Consistent language please, we are not using IMS elsewhere, this feature is called device_msi in Linux. Thanks for pointing this out. I will correct it. Jason Best regards, baolu
Re: [PATCH v2 3/3] iommu/vt-d: Apply SATC policy
Hi Kevin, On 2/4/21 9:59 AM, Tian, Kevin wrote: From: Lu Baolu Sent: Wednesday, February 3, 2021 5:33 PM From: Yian Chen Starting from Intel VT-d v3.2, Intel platform BIOS can provide a new SATC table structure. SATC table lists a set of SoC integrated devices that require ATC to work (VT-d specification v3.2, section 8.8). Furthermore, This statement is not accurate. The purpose of SATC is to tell whether a SoC integrated device has been validated to meet the isolation requirements of using device TLB. All devices listed in SATC can have ATC safely enabled by OS. In addition, there is a flag for each listed device for whether ATC is a functional requirement. However, above description only captured the last point. You are right. This series only addresses the devices with the flag set which have functional requirement for ATS. the new version of IOMMU supports SoC device ATS in both its Scalable mode and legacy mode. When IOMMU is working in scalable mode, software must enable device ATS support. "must enable" is misleading here. You need describe the policies for three categories: - SATC devices with ATC_REQUIRED=1 - SATC devices with ATC_REQUIRED=0 - devices not listed in SATC, or when SATC is missing Yian is working on this part. We planed it for v5.13. He will bring it up for discussion later. On the other hand, when IOMMU is in legacy mode for whatever reason, the hardware managed ATS will automatically take effect and the SATC required devices can work transparently to the software. As the No background about hardware-managed ATS. result, software shouldn't enable ATS on that device, otherwise duplicate device TLB invalidations will occur. This description draws a equation between legacy mode and hardware managed ATS. Do we care about the scenario where there is no hardware managed ATS but people also want to turn on ATC in legacy mode? The hardware managed ATS is defined in the platform specific specification. The purpose of this hardware design is backward compatibility - legacy OSes with no SM or ATS awareness still running well on them. Thanks Kevin Best regards, baolu
Re: [PATCH v2 0/3] iommu/vt-d: Add support for ACPI/SATC table
Add the change log. Sorry for the quick sent. On 2021/2/3 17:33, Lu Baolu wrote: The Intel VT-d specification v3.2 comes with a new ACPI SATC (SoC- Integrated Address Translation Cache) reporting structure. The SoC integrated device enumerated in this table has a functional requirement to enable its ATC (Address Translation Cache) via the ATS capability for device operation. This patch set adds the code to parse the SATC table, enumerates the devices in it and satisfies the ATS requirement for them. Please help to review. I will queue it in VT-d update for v5.12 if no objection. Change log: v1->v2: - Rephrase some words in the cover letter, commit message and code comments. - Refactored a code style to make it look nicer. Best regards, baolu Yian Chen (3): iommu/vt-d: Add new enum value and structure for SATC iommu/vt-d: Parse SATC reporting structure iommu/vt-d: Apply SATC policy drivers/iommu/intel/dmar.c | 8 ++ drivers/iommu/intel/iommu.c | 162 +++- include/acpi/actbl1.h | 11 ++- include/linux/dmar.h| 2 + 4 files changed, 178 insertions(+), 5 deletions(-)
[PATCH v2 3/3] iommu/vt-d: Apply SATC policy
From: Yian Chen Starting from Intel VT-d v3.2, Intel platform BIOS can provide a new SATC table structure. SATC table lists a set of SoC integrated devices that require ATC to work (VT-d specification v3.2, section 8.8). Furthermore, the new version of IOMMU supports SoC device ATS in both its Scalable mode and legacy mode. When IOMMU is working in scalable mode, software must enable device ATS support. On the other hand, when IOMMU is in legacy mode for whatever reason, the hardware managed ATS will automatically take effect and the SATC required devices can work transparently to the software. As the result, software shouldn't enable ATS on that device, otherwise duplicate device TLB invalidations will occur. Signed-off-by: Yian Chen Signed-off-by: Lu Baolu --- drivers/iommu/intel/iommu.c | 73 +++-- 1 file changed, 69 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index ee0932307d64..3e30c340e6a9 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -872,6 +872,60 @@ static bool iommu_is_dummy(struct intel_iommu *iommu, struct device *dev) return false; } +static bool iommu_support_ats(struct intel_iommu *iommu) +{ + return ecap_dev_iotlb_support(iommu->ecap); +} + +static bool device_support_ats(struct pci_dev *dev) +{ + return pci_ats_supported(dev) && dmar_find_matched_atsr_unit(dev); +} + +static int segment_atc_required(u16 segment) +{ + struct acpi_dmar_satc *satc; + struct dmar_satc_unit *satcu; + int ret = 1; + + rcu_read_lock(); + list_for_each_entry_rcu(satcu, _satc_units, list) { + satc = container_of(satcu->hdr, struct acpi_dmar_satc, header); + if (satcu->atc_required && satcu->devices_cnt && + satc->segment == segment) + goto out; + } + ret = 0; +out: + rcu_read_unlock(); + return ret; +} + +static int device_atc_required(struct pci_dev *dev) +{ + struct dmar_satc_unit *satcu; + struct acpi_dmar_satc *satc; + struct device *tmp; + int i, ret = 1; + + dev = pci_physfn(dev); + rcu_read_lock(); + list_for_each_entry_rcu(satcu, _satc_units, list) { + satc = container_of(satcu->hdr, struct acpi_dmar_satc, header); + if (!satcu->atc_required || + satc->segment != pci_domain_nr(dev->bus)) + continue; + + for_each_dev_scope(satcu->devices, satcu->devices_cnt, i, tmp) + if (to_pci_dev(tmp) == dev) + goto out; + } + ret = 0; +out: + rcu_read_unlock(); + return ret; +} + struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 *devfn) { struct dmar_drhd_unit *drhd = NULL; @@ -2555,10 +2609,16 @@ static struct dmar_domain *dmar_insert_one_dev_info(struct intel_iommu *iommu, if (dev && dev_is_pci(dev)) { struct pci_dev *pdev = to_pci_dev(info->dev); - if (ecap_dev_iotlb_support(iommu->ecap) && - pci_ats_supported(pdev) && - dmar_find_matched_atsr_unit(pdev)) - info->ats_supported = 1; + /* +* Support ATS by default if it's supported by both IOMMU and +* client sides, except that the device's ATS is required by +* ACPI/SATC but the IOMMU scalable mode is disabled. In that +* case the hardware managed ATS will be automatically used. +*/ + if (iommu_support_ats(iommu) && device_support_ats(pdev)) { + if (!device_atc_required(pdev) || sm_supported(iommu)) + info->ats_supported = 1; + } if (sm_supported(iommu)) { if (pasid_supported(iommu)) { @@ -3155,6 +3215,11 @@ static int __init init_dmars(void) * endfor */ for_each_drhd_unit(drhd) { + if (pci_ats_disabled() && segment_atc_required(drhd->segment)) { + pr_warn("Scalable mode disabled -- Use hardware managed ATS because PCIe ATS is disabled but some devices in PCIe segment 0x%x require it.", + drhd->segment); + intel_iommu_sm = 0; + } /* * lock not needed as this is only incremented in the single * threaded kernel __init code path all other access are read -- 2.25.1
[PATCH v2 0/3] iommu/vt-d: Add support for ACPI/SATC table
The Intel VT-d specification v3.2 comes with a new ACPI SATC (SoC- Integrated Address Translation Cache) reporting structure. The SoC integrated device enumerated in this table has a functional requirement to enable its ATC (Address Translation Cache) via the ATS capability for device operation. This patch set adds the code to parse the SATC table, enumerates the devices in it and satisfies the ATS requirement for them. Please help to review. I will queue it in VT-d update for v5.12 if no objection. Yian Chen (3): iommu/vt-d: Add new enum value and structure for SATC iommu/vt-d: Parse SATC reporting structure iommu/vt-d: Apply SATC policy drivers/iommu/intel/dmar.c | 8 ++ drivers/iommu/intel/iommu.c | 162 +++- include/acpi/actbl1.h | 11 ++- include/linux/dmar.h| 2 + 4 files changed, 178 insertions(+), 5 deletions(-) -- 2.25.1
[PATCH v2 1/3] iommu/vt-d: Add new enum value and structure for SATC
From: Yian Chen Starting from Intel Platform VT-d v3.2, BIOS may provide new remapping structure SATC for SOC integrated devices, according to section 8.8 of Intel VT-d architecture specification v3.2. The SATC structure reports a list of the devices that require ATS for normal device operation. It is a functional requirement that these devices will not work without OS enabling ATS capability. This patch introduces the new enum value and structure to represent the remapping information. Kernel should parse the information from the reporting structure and enable ATC for the devices as needed. Signed-off-by: Yian Chen Signed-off-by: Lu Baolu --- include/acpi/actbl1.h | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h index 43549547ed3e..c38e08cf1b9e 100644 --- a/include/acpi/actbl1.h +++ b/include/acpi/actbl1.h @@ -514,7 +514,8 @@ enum acpi_dmar_type { ACPI_DMAR_TYPE_ROOT_ATS = 2, ACPI_DMAR_TYPE_HARDWARE_AFFINITY = 3, ACPI_DMAR_TYPE_NAMESPACE = 4, - ACPI_DMAR_TYPE_RESERVED = 5 /* 5 and greater are reserved */ + ACPI_DMAR_TYPE_SATC = 5, + ACPI_DMAR_TYPE_RESERVED = 6 /* 6 and greater are reserved */ }; /* DMAR Device Scope structure */ @@ -607,6 +608,14 @@ struct acpi_dmar_andd { char device_name[1]; }; +/* 5: SOC Integrated Address Translation Cache Reporting Structure */ + +struct acpi_dmar_satc { + struct acpi_dmar_header header; + u8 flags; + u8 reserved; + u16 segment; +}; /*** * * DRTM - Dynamic Root of Trust for Measurement table -- 2.25.1
[PATCH v2 2/3] iommu/vt-d: Parse SATC reporting structure
From: Yian Chen Software should parse every SATC table and all devices in the tables reported by the BIOS and keep the information in kernel list for further reference. Signed-off-by: Yian Chen Signed-off-by: Lu Baolu --- drivers/iommu/intel/dmar.c | 8 drivers/iommu/intel/iommu.c | 89 + include/linux/dmar.h| 2 + 3 files changed, 99 insertions(+) diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c index 980e8ae090af..d5c51b5c20af 100644 --- a/drivers/iommu/intel/dmar.c +++ b/drivers/iommu/intel/dmar.c @@ -526,6 +526,7 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header *header) struct acpi_dmar_reserved_memory *rmrr; struct acpi_dmar_atsr *atsr; struct acpi_dmar_rhsa *rhsa; + struct acpi_dmar_satc *satc; switch (header->type) { case ACPI_DMAR_TYPE_HARDWARE_UNIT: @@ -555,6 +556,10 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header *header) /* We don't print this here because we need to sanity-check it first. So print it in dmar_parse_one_andd() instead. */ break; + case ACPI_DMAR_TYPE_SATC: + satc = container_of(header, struct acpi_dmar_satc, header); + pr_info("SATC flags: 0x%x\n", satc->flags); + break; } } @@ -642,6 +647,7 @@ parse_dmar_table(void) .cb[ACPI_DMAR_TYPE_ROOT_ATS] = _parse_one_atsr, .cb[ACPI_DMAR_TYPE_HARDWARE_AFFINITY] = _parse_one_rhsa, .cb[ACPI_DMAR_TYPE_NAMESPACE] = _parse_one_andd, + .cb[ACPI_DMAR_TYPE_SATC] = _parse_one_satc, }; /* @@ -2077,6 +2083,7 @@ static guid_t dmar_hp_guid = #defineDMAR_DSM_FUNC_DRHD 1 #defineDMAR_DSM_FUNC_ATSR 2 #defineDMAR_DSM_FUNC_RHSA 3 +#defineDMAR_DSM_FUNC_SATC 4 static inline bool dmar_detect_dsm(acpi_handle handle, int func) { @@ -2094,6 +2101,7 @@ static int dmar_walk_dsm_resource(acpi_handle handle, int func, [DMAR_DSM_FUNC_DRHD] = ACPI_DMAR_TYPE_HARDWARE_UNIT, [DMAR_DSM_FUNC_ATSR] = ACPI_DMAR_TYPE_ROOT_ATS, [DMAR_DSM_FUNC_RHSA] = ACPI_DMAR_TYPE_HARDWARE_AFFINITY, + [DMAR_DSM_FUNC_SATC] = ACPI_DMAR_TYPE_SATC, }; if (!dmar_detect_dsm(handle, func)) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index ecbd05d8a1fc..ee0932307d64 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -316,8 +316,18 @@ struct dmar_atsr_unit { u8 include_all:1; /* include all ports */ }; +struct dmar_satc_unit { + struct list_head list; /* list of SATC units */ + struct acpi_dmar_header *hdr; /* ACPI header */ + struct dmar_dev_scope *devices; /* target devices */ + struct intel_iommu *iommu; /* the corresponding iommu */ + int devices_cnt;/* target device count */ + u8 atc_required:1; /* ATS is required */ +}; + static LIST_HEAD(dmar_atsr_units); static LIST_HEAD(dmar_rmrr_units); +static LIST_HEAD(dmar_satc_units); #define for_each_rmrr_units(rmrr) \ list_for_each_entry(rmrr, _rmrr_units, list) @@ -3716,6 +3726,57 @@ int dmar_check_one_atsr(struct acpi_dmar_header *hdr, void *arg) return 0; } +static struct dmar_satc_unit *dmar_find_satc(struct acpi_dmar_satc *satc) +{ + struct dmar_satc_unit *satcu; + struct acpi_dmar_satc *tmp; + + list_for_each_entry_rcu(satcu, _satc_units, list, + dmar_rcu_check()) { + tmp = (struct acpi_dmar_satc *)satcu->hdr; + if (satc->segment != tmp->segment) + continue; + if (satc->header.length != tmp->header.length) + continue; + if (memcmp(satc, tmp, satc->header.length) == 0) + return satcu; + } + + return NULL; +} + +int dmar_parse_one_satc(struct acpi_dmar_header *hdr, void *arg) +{ + struct acpi_dmar_satc *satc; + struct dmar_satc_unit *satcu; + + if (system_state >= SYSTEM_RUNNING && !intel_iommu_enabled) + return 0; + + satc = container_of(hdr, struct acpi_dmar_satc, header); + satcu = dmar_find_satc(satc); + if (satcu) + return 0; + + satcu = kzalloc(sizeof(*satcu) + hdr->length, GFP_KERNEL); + if (!satcu) + return -ENOMEM; + + satcu->hdr = (void *)(satcu + 1); + memcpy(satcu->hdr, hdr, hdr->length); + satcu->atc_required = satc->flags & 0x1; + satcu->devices = dmar_alloc_dev_scope((void *)(satc + 1), + (void *)satc + satc->header.length, +
Re: [PATCH 0/3] iommu/vt-d: Add support for ACPI/SATC table
Hi Christoph, On 2021/2/3 16:41, Christoph Hellwig wrote: On Tue, Feb 02, 2021 at 12:40:54PM +0800, Lu Baolu wrote: Intel platform VT-d (v3.2) comes with a new type of DMAR subtable SATC. The SATC table includes a list of SoC integrated devices that require SATC. OS software can use this table to enable ATS only for the devices in the list. This all sounds like gibberish. Please expand and if necessary explain acronyms when first used in each commit log / cover letter. I will rephrase the words. Best regards, baolu
Re: [PATCH 3/3] iommu/vt-d: Apply SATC policy
On 2/2/21 9:55 PM, Joerg Roedel wrote: On Tue, Feb 02, 2021 at 12:40:57PM +0800, Lu Baolu wrote: + list_for_each_entry_rcu(satcu, _satc_units, list) { + satc = container_of(satcu->hdr, struct acpi_dmar_satc, header); + if (satc->segment == pci_domain_nr(dev->bus) && satcu->atc_required) { You can safe a level of indentation and make this look nicer if you do: if (satc->segment != pci_domain_nr(dev->bus) || !satcu->atc_required) continue; Yes. Thanks! Best regards, baolu + for_each_dev_scope(satcu->devices, satcu->devices_cnt, i, tmp) + if (to_pci_dev(tmp) == dev) + goto out; + } + } + ret = 0; +out: + rcu_read_unlock(); + return ret; +}
Re: [PATCH 2/3] iommu/vt-d: Parse SATC reporting structure
Hi Ashok, On 2/3/21 12:41 AM, Raj, Ashok wrote: On Tue, Feb 02, 2021 at 12:40:56PM +0800, Lu Baolu wrote: From: Yian Chen Software should parse every SATC table and all devices in the tables reported by the BIOS and keep the information in kernel list for further SATC policy deployment. The last part seems bit vague? Are you trying to imply, if kernel is booted with noats for instance, a device listed in SATC table as "requires ATS" will fail to load? Otherwise its not clear what the policy enforcement means. Yes. This is a bit vague. The ATS policy is out of the purpose of this patch. It only parses the table and keep the device list for further reference. Best regards, baolu
Re: [PATCH 2/3] iommu/vt-d: Parse SATC reporting structure
Hi Joerg, On 2/2/21 9:53 PM, Joerg Roedel wrote: On Tue, Feb 02, 2021 at 12:40:56PM +0800, Lu Baolu wrote: + case ACPI_DMAR_TYPE_SATC: + satc = container_of(header, struct acpi_dmar_satc, header); + pr_info("SATC flags: 0x%x\n", satc->flags); + break; Did the pr_info() slip through or is there a real purpose for having this information in the kernel log? Here is just the easiest way for users to know SATC: system has SATC? ATS is required? Best regards, baolu
Re: [PATCH 1/3] iommu/vt-d: Add new enum value and structure for SATC
Hi Ashok, On 2/3/21 12:02 AM, Raj, Ashok wrote: On Tue, Feb 02, 2021 at 12:40:55PM +0800, Lu Baolu wrote: From: Yian Chen Starting from Intel Platform VT-d v3.2, BIOS may provide new remapping structure SATC for SOC integrated devices, according to section 8.8 of Intel VT-d architecture specification v3.2. The SATC structure reports a list of the devices that require SATC enabling via ATS capacity. nit: s/require SATC/require ATS for normal device operation. This is a functional requirement that these devices will not work without OS enabling ATS capability. Yes. This looks clearer. Best regards, baolu This patch introduces the new enum value and structure to represent the remapping information. Kernel should parse the information from the reporting structure and enable ATC for the devices as needed. Signed-off-by: Yian Chen --- include/acpi/actbl1.h | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h index 43549547ed3e..b7ca802b66d2 100644 --- a/include/acpi/actbl1.h +++ b/include/acpi/actbl1.h @@ -514,7 +514,8 @@ enum acpi_dmar_type { ACPI_DMAR_TYPE_ROOT_ATS = 2, ACPI_DMAR_TYPE_HARDWARE_AFFINITY = 3, ACPI_DMAR_TYPE_NAMESPACE = 4, - ACPI_DMAR_TYPE_RESERVED = 5 /* 5 and greater are reserved */ + ACPI_DMAR_TYPE_SATC = 5, + ACPI_DMAR_TYPE_RESERVED = 6 /* 5 and greater are reserved */ }; Think Joerg spotted the comment update.
Re: [PATCH 1/3] iommu/vt-d: Add new enum value and structure for SATC
Hi Joerg, On 2/2/21 9:51 PM, Joerg Roedel wrote: Hi Baolu, On Tue, Feb 02, 2021 at 12:40:55PM +0800, Lu Baolu wrote: @@ -514,7 +514,8 @@ enum acpi_dmar_type { ACPI_DMAR_TYPE_ROOT_ATS = 2, ACPI_DMAR_TYPE_HARDWARE_AFFINITY = 3, ACPI_DMAR_TYPE_NAMESPACE = 4, - ACPI_DMAR_TYPE_RESERVED = 5 /* 5 and greater are reserved */ + ACPI_DMAR_TYPE_SATC = 5, + ACPI_DMAR_TYPE_RESERVED = 6 /* 5 and greater are reserved */ Nit: The comment needs to be updated too. Yes. Will update it. Best regards, baolu
[PATCH 3/3] iommu/vt-d: Apply SATC policy
From: Yian Chen Starting from Intel VT-d v3.2, Intel platform BIOS can provide a new SATC table structure. SATC table lists a set of SoC integrated devices that require ATC to work (VT-d specification v3.2, section 8.8). Furthermore, the new version of IOMMU supports SoC device ATS in both its Scalable mode and legacy mode. When IOMMU is working in scalable mode, software must enable device ATS support. On the other hand, when IOMMU is in legacy mode for whatever reason, the hardware managed ATS will automatically take effect and the SATC required devices can work transparently to the software. As the result, software shouldn't enable ATS on that device, otherwise duplicate device TLB invalidations will occur. Signed-off-by: Yian Chen --- drivers/iommu/intel/iommu.c | 72 ++--- 1 file changed, 68 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index b20fd305fc60..131fac718a4b 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -872,6 +872,59 @@ static bool iommu_is_dummy(struct intel_iommu *iommu, struct device *dev) return false; } +static bool iommu_support_ats(struct intel_iommu *iommu) +{ + return ecap_dev_iotlb_support(iommu->ecap); +} + +static bool device_support_ats(struct pci_dev *dev) +{ + return pci_ats_supported(dev) && dmar_find_matched_atsr_unit(dev); +} + +static int segment_atc_required(u16 segment) +{ + struct acpi_dmar_satc *satc; + struct dmar_satc_unit *satcu; + int ret = 1; + + rcu_read_lock(); + list_for_each_entry_rcu(satcu, _satc_units, list) { + satc = container_of(satcu->hdr, struct acpi_dmar_satc, header); + if (satcu->atc_required && + satcu->devices_cnt && + satc->segment == segment) + goto out; + } + ret = 0; +out: + rcu_read_unlock(); + return ret; +} + +static int device_atc_required(struct pci_dev *dev) +{ + struct dmar_satc_unit *satcu; + struct acpi_dmar_satc *satc; + struct device *tmp; + int i, ret = 1; + + dev = pci_physfn(dev); + rcu_read_lock(); + list_for_each_entry_rcu(satcu, _satc_units, list) { + satc = container_of(satcu->hdr, struct acpi_dmar_satc, header); + if (satc->segment == pci_domain_nr(dev->bus) && satcu->atc_required) { + for_each_dev_scope(satcu->devices, satcu->devices_cnt, i, tmp) + if (to_pci_dev(tmp) == dev) + goto out; + } + } + ret = 0; +out: + rcu_read_unlock(); + return ret; +} + struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 *devfn) { struct dmar_drhd_unit *drhd = NULL; @@ -2575,10 +2628,16 @@ static struct dmar_domain *dmar_insert_one_dev_info(struct intel_iommu *iommu, if (dev && dev_is_pci(dev)) { struct pci_dev *pdev = to_pci_dev(info->dev); - if (ecap_dev_iotlb_support(iommu->ecap) && - pci_ats_supported(pdev) && - dmar_find_matched_atsr_unit(pdev)) - info->ats_supported = 1; + /* +* Support ATS by default if it's supported by both IOMMU and +* client sides, except that the device's ATS is required by +* ACPI/SATC but the IOMMU scalable mode is disabled. In that +* case the hardware managed ATS will be automatically used. +*/ + if (iommu_support_ats(iommu) && device_support_ats(pdev)) { + if (!device_atc_required(pdev) || sm_supported(iommu)) + info->ats_supported = 1; + } if (sm_supported(iommu)) { if (pasid_supported(iommu)) { @@ -3175,6 +3234,11 @@ static int __init init_dmars(void) * endfor */ for_each_drhd_unit(drhd) { + if (pci_ats_disabled() && segment_atc_required(drhd->segment)) { + pr_warn("Scalable mode disabled -- Use hardware managed ATS because PCIe ATS is disabled but some devices in PCIe segment 0x%x require it.", + drhd->segment); + intel_iommu_sm = 0; + } /* * lock not needed as this is only incremented in the single * threaded kernel __init code path all other access are read -- 2.25.1
[PATCH 2/3] iommu/vt-d: Parse SATC reporting structure
From: Yian Chen Software should parse every SATC table and all devices in the tables reported by the BIOS and keep the information in kernel list for further SATC policy deployment. Signed-off-by: Yian Chen --- drivers/iommu/intel/dmar.c | 9 drivers/iommu/intel/iommu.c | 89 + include/linux/dmar.h| 2 + 3 files changed, 100 insertions(+) diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c index 980e8ae090af..6bd357966d4e 100644 --- a/drivers/iommu/intel/dmar.c +++ b/drivers/iommu/intel/dmar.c @@ -526,6 +526,7 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header *header) struct acpi_dmar_reserved_memory *rmrr; struct acpi_dmar_atsr *atsr; struct acpi_dmar_rhsa *rhsa; + struct acpi_dmar_satc *satc; switch (header->type) { case ACPI_DMAR_TYPE_HARDWARE_UNIT: @@ -555,6 +556,11 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header *header) /* We don't print this here because we need to sanity-check it first. So print it in dmar_parse_one_andd() instead. */ break; + case ACPI_DMAR_TYPE_SATC: + satc = container_of(header, struct acpi_dmar_satc, header); + pr_info("SATC flags: 0x%x\n", satc->flags); + break; + } } @@ -642,6 +648,7 @@ parse_dmar_table(void) .cb[ACPI_DMAR_TYPE_ROOT_ATS] = _parse_one_atsr, .cb[ACPI_DMAR_TYPE_HARDWARE_AFFINITY] = _parse_one_rhsa, .cb[ACPI_DMAR_TYPE_NAMESPACE] = _parse_one_andd, + .cb[ACPI_DMAR_TYPE_SATC] = _parse_one_satc, }; /* @@ -2077,6 +2084,7 @@ static guid_t dmar_hp_guid = #defineDMAR_DSM_FUNC_DRHD 1 #defineDMAR_DSM_FUNC_ATSR 2 #defineDMAR_DSM_FUNC_RHSA 3 +#defineDMAR_DSM_FUNC_SATC 4 static inline bool dmar_detect_dsm(acpi_handle handle, int func) { @@ -2094,6 +2102,7 @@ static int dmar_walk_dsm_resource(acpi_handle handle, int func, [DMAR_DSM_FUNC_DRHD] = ACPI_DMAR_TYPE_HARDWARE_UNIT, [DMAR_DSM_FUNC_ATSR] = ACPI_DMAR_TYPE_ROOT_ATS, [DMAR_DSM_FUNC_RHSA] = ACPI_DMAR_TYPE_HARDWARE_AFFINITY, + [DMAR_DSM_FUNC_SATC] = ACPI_DMAR_TYPE_SATC, }; if (!dmar_detect_dsm(handle, func)) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index ecd36e456de3..b20fd305fc60 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -316,8 +316,18 @@ struct dmar_atsr_unit { u8 include_all:1; /* include all ports */ }; +struct dmar_satc_unit { + struct list_head list; /* list of SATC units */ + struct acpi_dmar_header *hdr; /* ACPI header */ + struct dmar_dev_scope *devices; /* target devices */ + struct intel_iommu *iommu; /* the corresponding iommu */ + int devices_cnt;/* target device count */ + u8 atc_required:1; /* ATS is required */ +}; + static LIST_HEAD(dmar_atsr_units); static LIST_HEAD(dmar_rmrr_units); +static LIST_HEAD(dmar_satc_units); #define for_each_rmrr_units(rmrr) \ list_for_each_entry(rmrr, _rmrr_units, list) @@ -3736,6 +3746,57 @@ int dmar_check_one_atsr(struct acpi_dmar_header *hdr, void *arg) return 0; } +static struct dmar_satc_unit *dmar_find_satc(struct acpi_dmar_satc *satc) +{ + struct dmar_satc_unit *satcu; + struct acpi_dmar_satc *tmp; + + list_for_each_entry_rcu(satcu, _satc_units, list, + dmar_rcu_check()) { + tmp = (struct acpi_dmar_satc *)satcu->hdr; + if (satc->segment != tmp->segment) + continue; + if (satc->header.length != tmp->header.length) + continue; + if (memcmp(satc, tmp, satc->header.length) == 0) + return satcu; + } + + return NULL; +} + +int dmar_parse_one_satc(struct acpi_dmar_header *hdr, void *arg) +{ + struct acpi_dmar_satc *satc; + struct dmar_satc_unit *satcu; + + if (system_state >= SYSTEM_RUNNING && !intel_iommu_enabled) + return 0; + + satc = container_of(hdr, struct acpi_dmar_satc, header); + satcu = dmar_find_satc(satc); + if (satcu) + return 0; + + satcu = kzalloc(sizeof(*satcu) + hdr->length, GFP_KERNEL); + if (!satcu) + return -ENOMEM; + + satcu->hdr = (void *)(satcu + 1); + memcpy(satcu->hdr, hdr, hdr->length); + satcu->atc_required = satc->flags & 0x1; + satcu->devices = dmar_alloc_dev_scope((void *)(satc + 1), + (void *)satc + satc->header.length, + >devices_cnt); + if (satcu->devices_cnt && !satcu->devices)
[PATCH 1/3] iommu/vt-d: Add new enum value and structure for SATC
From: Yian Chen Starting from Intel Platform VT-d v3.2, BIOS may provide new remapping structure SATC for SOC integrated devices, according to section 8.8 of Intel VT-d architecture specification v3.2. The SATC structure reports a list of the devices that require SATC enabling via ATS capacity. This patch introduces the new enum value and structure to represent the remapping information. Kernel should parse the information from the reporting structure and enable ATC for the devices as needed. Signed-off-by: Yian Chen --- include/acpi/actbl1.h | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h index 43549547ed3e..b7ca802b66d2 100644 --- a/include/acpi/actbl1.h +++ b/include/acpi/actbl1.h @@ -514,7 +514,8 @@ enum acpi_dmar_type { ACPI_DMAR_TYPE_ROOT_ATS = 2, ACPI_DMAR_TYPE_HARDWARE_AFFINITY = 3, ACPI_DMAR_TYPE_NAMESPACE = 4, - ACPI_DMAR_TYPE_RESERVED = 5 /* 5 and greater are reserved */ + ACPI_DMAR_TYPE_SATC = 5, + ACPI_DMAR_TYPE_RESERVED = 6 /* 5 and greater are reserved */ }; /* DMAR Device Scope structure */ @@ -607,6 +608,14 @@ struct acpi_dmar_andd { char device_name[1]; }; +/* 5: SOC Integrated Address Translation Cache Reporting Structure */ + +struct acpi_dmar_satc { + struct acpi_dmar_header header; + u8 flags; + u8 reserved; + u16 segment; +}; /*** * * DRTM - Dynamic Root of Trust for Measurement table -- 2.25.1
[PATCH 0/3] iommu/vt-d: Add support for ACPI/SATC table
Intel platform VT-d (v3.2) comes with a new type of DMAR subtable SATC. The SATC table includes a list of SoC integrated devices that require SATC. OS software can use this table to enable ATS only for the devices in the list. Yian Chen (3): iommu/vt-d: Add new enum value and structure for SATC iommu/vt-d: Parse SATC reporting structure iommu/vt-d: Apply SATC policy drivers/iommu/intel/dmar.c | 9 ++ drivers/iommu/intel/iommu.c | 161 +++- include/acpi/actbl1.h | 11 ++- include/linux/dmar.h| 2 + 4 files changed, 178 insertions(+), 5 deletions(-) -- 2.25.1
[PATCH 1/1] iommu/vt-d: Fix compile error [-Werror=implicit-function-declaration]
trace_qi_submit() could be used when interrupt remapping is supported, but DMA remapping is not. In this case, the following compile error occurs. ../drivers/iommu/intel/dmar.c: In function 'qi_submit_sync': ../drivers/iommu/intel/dmar.c:1311:3: error: implicit declaration of function 'trace_qi_submit'; did you mean 'ftrace_nmi_exit'? [-Werror=implicit-function-declaration] trace_qi_submit(iommu, desc[i].qw0, desc[i].qw1, ^~~ ftrace_nmi_exit Fixes: f2dd871799ba5 ("iommu/vt-d: Add qi_submit trace event") Reported-by: kernel test robot Reported-by: Randy Dunlap Signed-off-by: Lu Baolu --- drivers/iommu/intel/Makefile | 2 +- drivers/iommu/intel/iommu.c| 1 - include/trace/events/intel_iommu.h | 2 -- 3 files changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/iommu/intel/Makefile b/drivers/iommu/intel/Makefile index fb8e1e8c8029..ae570810a35e 100644 --- a/drivers/iommu/intel/Makefile +++ b/drivers/iommu/intel/Makefile @@ -1,7 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_DMAR_TABLE) += dmar.o obj-$(CONFIG_INTEL_IOMMU) += iommu.o pasid.o -obj-$(CONFIG_INTEL_IOMMU) += trace.o +obj-$(CONFIG_DMAR_TABLE) += trace.o obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += debugfs.o obj-$(CONFIG_INTEL_IOMMU_SVM) += svm.o obj-$(CONFIG_IRQ_REMAP) += irq_remapping.o diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index d7fecc109947..37da4caa67c9 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -44,7 +44,6 @@ #include #include #include -#include #include "../irq_remapping.h" #include "pasid.h" diff --git a/include/trace/events/intel_iommu.h b/include/trace/events/intel_iommu.h index aad2ff0c1e2e..e801f4910522 100644 --- a/include/trace/events/intel_iommu.h +++ b/include/trace/events/intel_iommu.h @@ -6,7 +6,6 @@ * * Author: Lu Baolu */ -#ifdef CONFIG_INTEL_IOMMU #undef TRACE_SYSTEM #define TRACE_SYSTEM intel_iommu @@ -176,4 +175,3 @@ TRACE_EVENT(qi_submit, /* This part must be outside protection */ #include -#endif /* CONFIG_INTEL_IOMMU */ -- 2.25.1
Re: [PATCH v3] iommu/vt-d: do not use flush-queue when caching-mode is on
On 1/28/21 1:53 AM, Nadav Amit wrote: From: Nadav Amit When an Intel IOMMU is virtualized, and a physical device is passed-through to the VM, changes of the virtual IOMMU need to be propagated to the physical IOMMU. The hypervisor therefore needs to monitor PTE mappings in the IOMMU page-tables. Intel specifications provide "caching-mode" capability that a virtual IOMMU uses to report that the IOMMU is virtualized and a TLB flush is needed after mapping to allow the hypervisor to propagate virtual IOMMU mappings to the physical IOMMU. To the best of my knowledge no real physical IOMMU reports "caching-mode" as turned on. Synchronizing the virtual and the physical IOMMU tables is expensive if the hypervisor is unaware which PTEs have changed, as the hypervisor is required to walk all the virtualized tables and look for changes. Consequently, domain flushes are much more expensive than page-specific flushes on virtualized IOMMUs with passthrough devices. The kernel therefore exploited the "caching-mode" indication to avoid domain flushing and use page-specific flushing in virtualized environments. See commit 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching mode.") This behavior changed after commit 13cf01744608 ("iommu/vt-d: Make use of iova deferred flushing"). Now, when batched TLB flushing is used (the default), full TLB domain flushes are performed frequently, requiring the hypervisor to perform expensive synchronization between the virtual TLB and the physical one. Getting batched TLB flushes to use page-specific invalidations again in such circumstances is not easy, since the TLB invalidation scheme assumes that "full" domain TLB flushes are performed for scalability. Disable batched TLB flushes when caching-mode is on, as the performance benefit from using batched TLB invalidations is likely to be much smaller than the overhead of the virtual-to-physical IOMMU page-tables synchronization. Fixes: 13cf01744608 ("iommu/vt-d: Make use of iova deferred flushing") Signed-off-by: Nadav Amit Cc: David Woodhouse Cc: Lu Baolu Cc: Joerg Roedel Cc: Will Deacon Cc: sta...@vger.kernel.org --- v2->v3: * Fix the fixes tag in the commit-log (Lu). * Minor English rephrasing of the commit-log. v1->v2: * disable flush queue for all domains if caching-mode is on for any IOMMU (Lu). --- drivers/iommu/intel/iommu.c | 32 +++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 788119c5b021..de3dd617cf60 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -5373,6 +5373,36 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain, return ret; } +static bool domain_use_flush_queue(void) +{ + struct dmar_drhd_unit *drhd; + struct intel_iommu *iommu; + bool r = true; + + if (intel_iommu_strict) + return false; + + /* +* The flush queue implementation does not perform page-selective +* invalidations that are required for efficient TLB flushes in virtual +* environments. The benefit of batching is likely to be much lower than +* the overhead of synchronizing the virtual and physical IOMMU +* page-tables. +*/ + rcu_read_lock(); + for_each_active_iommu(iommu, drhd) { + if (!cap_caching_mode(iommu->cap)) + continue; + + pr_warn_once("IOMMU batching is disabled due to virtualization"); + r = false; + break; + } + rcu_read_unlock(); + + return r; +} + static int intel_iommu_domain_get_attr(struct iommu_domain *domain, enum iommu_attr attr, void *data) @@ -5383,7 +5413,7 @@ intel_iommu_domain_get_attr(struct iommu_domain *domain, case IOMMU_DOMAIN_DMA: switch (attr) { case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE: - *(int *)data = !intel_iommu_strict; + *(int *)data = domain_use_flush_queue(); return 0; default: return -ENODEV; Acked-by: Lu Baolu Best regards, baolu
Re: [PATCH v2] iommu/vt-d: do not use flush-queue when caching-mode is on
On 2021/1/27 14:17, Nadav Amit wrote: From: Nadav Amit When an Intel IOMMU is virtualized, and a physical device is passed-through to the VM, changes of the virtual IOMMU need to be propagated to the physical IOMMU. The hypervisor therefore needs to monitor PTE mappings in the IOMMU page-tables. Intel specifications provide "caching-mode" capability that a virtual IOMMU uses to report that the IOMMU is virtualized and a TLB flush is needed after mapping to allow the hypervisor to propagate virtual IOMMU mappings to the physical IOMMU. To the best of my knowledge no real physical IOMMU reports "caching-mode" as turned on. Synchronizing the virtual and the physical IOMMU tables is expensive if the hypervisor is unaware which PTEs have changed, as the hypervisor is required to walk all the virtualized tables and look for changes. Consequently, domain flushes are much more expensive than page-specific flushes on virtualized IOMMUs with passthrough devices. The kernel therefore exploited the "caching-mode" indication to avoid domain flushing and use page-specific flushing in virtualized environments. See commit 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching mode.") This behavior changed after commit 13cf01744608 ("iommu/vt-d: Make use of iova deferred flushing"). Now, when batched TLB flushing is used (the default), full TLB domain flushes are performed frequently, requiring the hypervisor to perform expensive synchronization between the virtual TLB and the physical one. Getting batched TLB flushes to use in such circumstances page-specific invalidations again is not easy, since the TLB invalidation scheme assumes that "full" domain TLB flushes are performed for scalability. Disable batched TLB flushes when caching-mode is on, as the performance benefit from using batched TLB invalidations is likely to be much smaller than the overhead of the virtual-to-physical IOMMU page-tables synchronization. Fixes: 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching mode.") Isn't it Fixes: 13cf01744608 ("iommu/vt-d: Make use of iova deferred flushing") ? Best regards, baolu Signed-off-by: Nadav Amit Cc: David Woodhouse Cc: Lu Baolu Cc: Joerg Roedel Cc: Will Deacon Cc: sta...@vger.kernel.org > --- v1->v2: * disable flush queue for all domains if caching-mode is on for any IOMMU (Lu). --- drivers/iommu/intel/iommu.c | 32 +++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 788119c5b021..de3dd617cf60 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -5373,6 +5373,36 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain, return ret; } +static bool domain_use_flush_queue(void) +{ + struct dmar_drhd_unit *drhd; + struct intel_iommu *iommu; + bool r = true; + + if (intel_iommu_strict) + return false; + + /* +* The flush queue implementation does not perform page-selective +* invalidations that are required for efficient TLB flushes in virtual +* environments. The benefit of batching is likely to be much lower than +* the overhead of synchronizing the virtual and physical IOMMU +* page-tables. +*/ + rcu_read_lock(); + for_each_active_iommu(iommu, drhd) { + if (!cap_caching_mode(iommu->cap)) + continue; + + pr_warn_once("IOMMU batching is disabled due to virtualization"); + r = false; + break; + } + rcu_read_unlock(); + + return r; +} + static int intel_iommu_domain_get_attr(struct iommu_domain *domain, enum iommu_attr attr, void *data) @@ -5383,7 +5413,7 @@ intel_iommu_domain_get_attr(struct iommu_domain *domain, case IOMMU_DOMAIN_DMA: switch (attr) { case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE: - *(int *)data = !intel_iommu_strict; + *(int *)data = domain_use_flush_queue(); return 0; default: return -ENODEV;
Re: [PATCH] iommu/vt-d: do not use flush-queue when caching-mode is on
On 1/27/21 8:26 AM, Lu Baolu wrote: +{ + struct dmar_domain *dmar_domain = to_dmar_domain(domain); + struct intel_iommu *iommu = domain_get_iommu(dmar_domain); + + if (intel_iommu_strict) + return 0; + + /* + * The flush queue implementation does not perform page-selective + * invalidations that are required for efficient TLB flushes in virtual + * environments. The benefit of batching is likely to be much lower than + * the overhead of synchronizing the virtual and physical IOMMU + * page-tables. + */ + if (iommu && cap_caching_mode(iommu->cap)) { + pr_warn_once("IOMMU batching is partially disabled due to virtualization"); + return 0; + } domain_get_iommu() only returns the first iommu, and could return NULL when this is called before domain attaching to any device. A better choice could be check caching mode globally and return false if caching mode is supported on any iommu. struct dmar_drhd_unit *drhd; struct intel_iommu *iommu; rcu_read_lock(); for_each_active_iommu(iommu, drhd) { if (cap_caching_mode(iommu->cap)) return false; We should unlock rcu before return here. Sorry! } rcu_read_unlock(); Best regards, baolu
Re: [PATCH] iommu/vt-d: do not use flush-queue when caching-mode is on
Hi Nadav, On 1/27/21 4:38 AM, Nadav Amit wrote: From: Nadav Amit When an Intel IOMMU is virtualized, and a physical device is passed-through to the VM, changes of the virtual IOMMU need to be propagated to the physical IOMMU. The hypervisor therefore needs to monitor PTE mappings in the IOMMU page-tables. Intel specifications provide "caching-mode" capability that a virtual IOMMU uses to report that the IOMMU is virtualized and a TLB flush is needed after mapping to allow the hypervisor to propagate virtual IOMMU mappings to the physical IOMMU. To the best of my knowledge no real physical IOMMU reports "caching-mode" as turned on. Synchronizing the virtual and the physical TLBs is expensive if the hypervisor is unaware which PTEs have changed, as the hypervisor is required to walk all the virtualized tables and look for changes. Consequently, domain flushes are much more expensive than page-specific flushes on virtualized IOMMUs with passthrough devices. The kernel therefore exploited the "caching-mode" indication to avoid domain flushing and use page-specific flushing in virtualized environments. See commit 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching mode.") This behavior changed after commit 13cf01744608 ("iommu/vt-d: Make use of iova deferred flushing"). Now, when batched TLB flushing is used (the default), full TLB domain flushes are performed frequently, requiring the hypervisor to perform expensive synchronization between the virtual TLB and the physical one. Good catch. Thank you! Getting batched TLB flushes to use in such circumstances page-specific invalidations again is not easy, since the TLB invalidation scheme assumes that "full" domain TLB flushes are performed for scalability. Disable batched TLB flushes when caching-mode is on, as the performance benefit from using batched TLB invalidations is likely to be much smaller than the overhead of the virtual-to-physical IOMMU page-tables synchronization. Fixes: 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching mode.") Signed-off-by: Nadav Amit Cc: David Woodhouse Cc: Lu Baolu Cc: Joerg Roedel Cc: Will Deacon Cc: sta...@vger.kernel.org --- drivers/iommu/intel/iommu.c | 26 +- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 788119c5b021..4e08f5e17175 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -5373,6 +5373,30 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain, return ret; } +static int +intel_iommu_domain_get_attr_use_flush_queue(struct iommu_domain *domain) Just nit: Can we just use this static bool domain_use_flush_queue(struct iommu_domain *domain) ? +{ + struct dmar_domain *dmar_domain = to_dmar_domain(domain); + struct intel_iommu *iommu = domain_get_iommu(dmar_domain); + + if (intel_iommu_strict) + return 0; + + /* +* The flush queue implementation does not perform page-selective +* invalidations that are required for efficient TLB flushes in virtual +* environments. The benefit of batching is likely to be much lower than +* the overhead of synchronizing the virtual and physical IOMMU +* page-tables. +*/ + if (iommu && cap_caching_mode(iommu->cap)) { + pr_warn_once("IOMMU batching is partially disabled due to virtualization"); + return 0; + } domain_get_iommu() only returns the first iommu, and could return NULL when this is called before domain attaching to any device. A better choice could be check caching mode globally and return false if caching mode is supported on any iommu. struct dmar_drhd_unit *drhd; struct intel_iommu *iommu; rcu_read_lock(); for_each_active_iommu(iommu, drhd) { if (cap_caching_mode(iommu->cap)) return false; } rcu_read_unlock(); + + return 1; +} + static int intel_iommu_domain_get_attr(struct iommu_domain *domain, enum iommu_attr attr, void *data) @@ -5383,7 +5407,7 @@ intel_iommu_domain_get_attr(struct iommu_domain *domain, case IOMMU_DOMAIN_DMA: switch (attr) { case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE: - *(int *)data = !intel_iommu_strict; + *(int *)data = !intel_iommu_domain_get_attr_use_flush_queue(domain); return 0; default: return -ENODEV; Best regards, baolu
[PATCH v2 1/2] iommu/vt-d: Clear PRQ overflow only when PRQ is empty
It is incorrect to always clear PRO when it's set w/o first checking whether the overflow condition has been cleared. Current code assumes that if an overflow condition occurs it must have been cleared by earlier loop. However since the code runs in a threaded context, the overflow condition could occur even after setting the head to the tail under some extreme condition. To be sane, we should read both head/tail again when seeing a pending PRO and only clear PRO after all pending PRs have been handled. Suggested-by: Kevin Tian Link: https://lore.kernel.org/linux-iommu/mwhpr11mb18862d2ea5bd432bf22d99a48c...@mwhpr11mb1886.namprd11.prod.outlook.com/ Signed-off-by: Lu Baolu --- drivers/iommu/intel/svm.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index 033b25886e57..d7c98c5fa4e7 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -1042,8 +1042,17 @@ static irqreturn_t prq_event_thread(int irq, void *d) * Clear the page request overflow bit and wake up all threads that * are waiting for the completion of this handling. */ - if (readl(iommu->reg + DMAR_PRS_REG) & DMA_PRS_PRO) - writel(DMA_PRS_PRO, iommu->reg + DMAR_PRS_REG); + if (readl(iommu->reg + DMAR_PRS_REG) & DMA_PRS_PRO) { + pr_info_ratelimited("IOMMU: %s: PRQ overflow detected\n", + iommu->name); + head = dmar_readq(iommu->reg + DMAR_PQH_REG) & PRQ_RING_MASK; + tail = dmar_readq(iommu->reg + DMAR_PQT_REG) & PRQ_RING_MASK; + if (head == tail) { + writel(DMA_PRS_PRO, iommu->reg + DMAR_PRS_REG); + pr_info_ratelimited("IOMMU: %s: PRQ overflow cleared", + iommu->name); + } + } if (!completion_done(>prq_complete)) complete(>prq_complete); -- 2.25.1
[PATCH v2 2/2] iommu/vt-d: Use INVALID response code instead of FAILURE
The VT-d IOMMU response RESPONSE_FAILURE for a page request in below cases: - When it gets a Page_Request with no PASID; - When it gets a Page_Request with PASID that is not in use for this device. This is allowed by the spec, but IOMMU driver doesn't support such cases today. When the device receives RESPONSE_FAILURE, it sends the device state machine to HALT state. Now if we try to unload the driver, it hangs since the device doesn't send any outbound transactions to host when the driver is trying to clear things up. The only possible responses would be for invalidation requests. Let's use RESPONSE_INVALID instead for now, so that the device state machine doesn't enter HALT state. Suggested-by: Ashok Raj Signed-off-by: Lu Baolu --- drivers/iommu/intel/svm.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index d7c98c5fa4e7..574a7e657a9a 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -911,10 +911,8 @@ static irqreturn_t prq_event_thread(int irq, void *d) u64 address; handled = 1; - req = >prq[head / sizeof(*req)]; - - result = QI_RESP_FAILURE; + result = QI_RESP_INVALID; address = (u64)req->addr << VTD_PAGE_SHIFT; if (!req->pasid_present) { pr_err("%s: Page request without PASID: %08llx %08llx\n", @@ -952,7 +950,6 @@ static irqreturn_t prq_event_thread(int irq, void *d) rcu_read_unlock(); } - result = QI_RESP_INVALID; /* Since we're using init_mm.pgd directly, we should never take * any faults on kernel addresses. */ if (!svm->mm) -- 2.25.1