[PATCH v1] arch/x86/kernel/pci-dma: Remove useless parameter of arch_dma_alloc_attrs
From: Huaisheng Yearch_dma_alloc_attrs has parameter gfp which is not used at all. Remove it. Signed-off-by: Huaisheng Ye Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Robin Murphy Cc: Konrad Rzeszutek Wilk Cc: Andrew Morton Cc: Greg Kroah-Hartman Cc: Tom Lendacky Cc: Kate Stewart Cc: Randy Dunlap Cc: Michal Hocko --- arch/x86/include/asm/dma-mapping.h | 2 +- arch/x86/kernel/pci-dma.c | 2 +- include/linux/dma-mapping.h| 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/dma-mapping.h b/arch/x86/include/asm/dma-mapping.h index 89ce4bf..ef59747 100644 --- a/arch/x86/include/asm/dma-mapping.h +++ b/arch/x86/include/asm/dma-mapping.h @@ -33,7 +33,7 @@ static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) int arch_dma_supported(struct device *dev, u64 mask); #define arch_dma_supported arch_dma_supported -bool arch_dma_alloc_attrs(struct device **dev, gfp_t *gfp); +bool arch_dma_alloc_attrs(struct device **dev); #define arch_dma_alloc_attrs arch_dma_alloc_attrs #endif diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c index 77625b6..94d1a49 100644 --- a/arch/x86/kernel/pci-dma.c +++ b/arch/x86/kernel/pci-dma.c @@ -76,7 +76,7 @@ void __init pci_iommu_alloc(void) } } -bool arch_dma_alloc_attrs(struct device **dev, gfp_t *gfp) +bool arch_dma_alloc_attrs(struct device **dev) { if (!*dev) *dev = _dma_fallback_dev; diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index f8ab1c0..c80bb09 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -502,7 +502,7 @@ void *dma_common_pages_remap(struct page **pages, size_t size, #define dma_get_sgtable(d, t, v, h, s) dma_get_sgtable_attrs(d, t, v, h, s, 0) #ifndef arch_dma_alloc_attrs -#define arch_dma_alloc_attrs(dev, flag)(true) +#define arch_dma_alloc_attrs(dev) (true) #endif static inline void *dma_alloc_attrs(struct device *dev, size_t size, @@ -521,7 +521,7 @@ static inline void *dma_alloc_attrs(struct device *dev, size_t size, /* let the implementation decide on the zone to allocate from: */ flag &= ~(__GFP_DMA | __GFP_DMA32 | __GFP_HIGHMEM); - if (!arch_dma_alloc_attrs(, )) + if (!arch_dma_alloc_attrs()) return NULL; if (!ops->alloc) return NULL; -- 1.8.3.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v6 05/17] media: rkisp1: add Rockchip ISP1 subdev driver
On Thu, May 24, 2018 at 8:30 PM Baruch Siachwrote: > Hi Tomasz, > On Mon, May 07, 2018 at 06:41:50AM +, Tomasz Figa wrote: > > On Mon, May 7, 2018 at 3:38 PM Baruch Siach wrote: > > > On Mon, May 07, 2018 at 06:13:27AM +, Tomasz Figa wrote: > > > > On Thu, May 3, 2018 at 6:09 PM Baruch Siach wrote: > > > > > On Thu, Mar 08, 2018 at 05:47:55PM +0800, Jacob Chen wrote: > > > > > > +static int rkisp1_isp_sd_s_power(struct v4l2_subdev *sd, int on) > > > > > > +{ > > > > > > + struct rkisp1_device *isp_dev = sd_to_isp_dev(sd); > > > > > > + int ret; > > > > > > + > > > > > > + v4l2_dbg(1, rkisp1_debug, _dev->v4l2_dev, "s_power: %d\n", > > > > on); > > > > > > + > > > > > > + if (on) { > > > > > > + ret = pm_runtime_get_sync(isp_dev->dev); > > > > > > + if (ret < 0) > > > > > > + return ret; > > > > > > + > > > > > > + rkisp1_config_clk(isp_dev); > > > > > > + } else { > > > > > > + ret = pm_runtime_put(isp_dev->dev); > > > > > > > > > I commented this line out to make more than one STREAMON work. > > Otherwise, > > > > > the second STREAMON hangs. I guess the bug is not this driver. > > Probably > > > > > something in drivers/soc/rockchip/pm_domains.c. Just noting that in > > case > > > > > you or someone on Cc would like to investigate it further. > > > > > > > > > > I tested v4.16-rc4 on the Tinkerboard. > > > > > > > > Looks like that version doesn't include the IOMMU PM and clock handling > > > > rework [1], which should fix a lot of runtime PM issues. FWIW, > > linux-next > > > > seems to already include it. > > > > > > > > [1] https://lkml.org/lkml/2018/3/23/44 > > > > > Thanks for the reference. > > > > > It looks like the iommu driver part is in Linus' tree already. The DT > > part is > > > in the v4.18-armsoc/dts32 branch of Heiko's tree. Am I missing anything? > > > > You're right, most of the series made it in time for 4.17. However, the DT > > part (precisely, the clocks properties added to IOMMU nodes) is crucial for > > the fixes to be effective. > > > > > Anyway, I'll take a look. > > > > Thanks for testing. :) (Forgot to mention in my previous email...) > I finally got around to testing. Unfortunately, kernel v4.17-rc6 with > cherry-pick of commit c78751f91c0b (ARM: dts: rockchip: add clocks in iommu > nodes) from Heiko's tree still exhibit the same problem. STREAMON hangs on > second try. The same workaround "fixes" it. Thanks for testing. I'm out of ideas, since the same code seems to work fine for us in Chrome OS 4.4 kernel. Maybe we could have someone from RK take a look. Best regards, Tomasz ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] drm/rockchip: vop: fix irq disabled after vop driver probed
Hi Heiko, Sandy, On Fri, May 25, 2018 at 7:07 AM Heiko Stübnerwrote: > From: Sandy Huang > The vop irq is shared between vop and iommu and irq probing in the > iommu driver moved to the probe function recently. This can in some > cases lead to a stall if the irq is triggered while the vop driver > still has it disabled. > But there is no real need to disable the irq, as the vop can simply > also track its enabled state and ignore irqs in that case. > So remove the enable/disable handling and add appropriate condition > to the irq handler. > Signed-off-by: Sandy Huang > [added an actual commit message] > Signed-off-by: Heiko Stuebner > --- > Hi Ezequiel, > this patch came from a discussion I had with Rockchip people over the > iommu changes and resulting issues back in april, but somehow was > forgotten and not posted to the lists. Correcting that now. > So removing the enable/disable voodoo on the shared interrupt is > the preferred way. > drivers/gpu/drm/rockchip/rockchip_drm_vop.c | 13 ++--- > 1 file changed, 7 insertions(+), 7 deletions(-) > diff --git a/drivers/gpu/drm/rockchip/rockchip_drm_vop.c b/drivers/gpu/drm/rockchip/rockchip_drm_vop.c > index 510cdf0..61493d4 100644 > --- a/drivers/gpu/drm/rockchip/rockchip_drm_vop.c > +++ b/drivers/gpu/drm/rockchip/rockchip_drm_vop.c > @@ -549,8 +549,6 @@ static int vop_enable(struct drm_crtc *crtc) > spin_unlock(>reg_lock); > - enable_irq(vop->irq); > - While this one should be okay (+/- my comment for vop_isr()), because the hardware is already powered on and clocked at this point... > drm_crtc_vblank_on(crtc); > return 0; > @@ -596,8 +594,6 @@ static void vop_crtc_atomic_disable(struct drm_crtc *crtc, > vop_dsp_hold_valid_irq_disable(vop); > - disable_irq(vop->irq); > - > vop->is_enabled = false; ...this one is more tricky. There might be an interrupt handler still running at this point. disable_irq() waits for any running handler to complete before disabling, so we might want to call synchronize_irq() after setting is_enabled to false. > /* > @@ -1168,6 +1164,13 @@ static irqreturn_t vop_isr(int irq, void *data) > int ret = IRQ_NONE; > /* > +* since the irq is shared with iommu, iommu irq is enabled before vop > +* enable, so before vop enable we do nothing. > +*/ > + if (!vop->is_enabled) > + return IRQ_NONE; This doesn't seem to be properly synchronized. We don't even call it under a spinlock, so no barriers are issued. Perhaps we should make this atomic_t? Best regards, Tomasz ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 13/40] vfio: Add support for Shared Virtual Addressing
Hi, On 2018/5/24 23:04, Jean-Philippe Brucker wrote: On 24/05/18 13:35, Xu Zaibo wrote: Right, sva_init() must be called once for any device that intends to use bind(). For the second process though, group->sva_enabled will be true so we won't call sva_init() again, only bind(). Well, while I create mediated devices based on one parent device to support multiple processes(a new process will create a new 'vfio_group' for the corresponding mediated device, and 'sva_enabled' cannot work any more), in fact, *sva_init and *sva_shutdown are basically working on parent device, so, as a result, I just only need sva initiation and shutdown on the parent device only once. So I change the two as following: @@ -551,8 +565,18 @@ int iommu_sva_device_init(struct device *dev, unsigned long features, if (features & ~IOMMU_SVA_FEAT_IOPF) return -EINVAL; +/* If already exists, do nothing */ +mutex_lock(>iommu_param->lock); +if (dev->iommu_param->sva_param) { +mutex_unlock(>iommu_param->lock); +return 0; +} +mutex_unlock(>iommu_param->lock); if (features & IOMMU_SVA_FEAT_IOPF) { ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf, @@ -621,6 +646,14 @@ int iommu_sva_device_shutdown(struct device *dev) if (!domain) return -ENODEV; +/* If any other process is working on the device, shut down does nothing. */ +mutex_lock(>iommu_param->lock); +if (!list_empty(>iommu_param->sva_param->mm_list)) { +mutex_unlock(>iommu_param->lock); +return 0; +} +mutex_unlock(>iommu_param->lock); I don't think iommu-sva.c is the best place for this, it's probably better to implement an intermediate layer (the mediating driver), that calls iommu_sva_device_init() and iommu_sva_device_shutdown() once. Then vfio-pci would still call these functions itself, but for mdev the mediating driver keeps a refcount of groups, and calls device_shutdown() only when freeing the last mdev. A device driver (non mdev in this example) expects to be able to free all its resources after sva_device_shutdown() returns. Imagine the mm_list isn't empty (mm_exit() is running late), and instead of waiting in unbind_dev_all() below, we return 0 immediately. Then the calling driver frees its resources, and the mm_exit callback along with private data passed to bind() disappear. If a mm_exit() is still running in parallel, then it will try to access freed data and corrupt memory. So in this function if mm_list isn't empty, the only thing we can do is wait. I still don't understand why we should 'unbind_dev_all', is it possible to do a 'unbind_dev_pasid'? Then we can do other things instead of waiting that user may not like. :) Thanks Zaibo + __iommu_sva_unbind_dev_all(dev); mutex_lock(>iommu_param->lock); I add the above two checkings in both *sva_init and *sva_shutdown, it is working now, but i don't know if it will cause any new problems. What's more, i doubt if it is reasonable to check this to avoid repeating operation in VFIO, maybe it is better to check in IOMMU. :) . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 2/2] iommu/amd: Add basic debugfs infrastructure for AMD IOMMU
On 05/18/2018 04:02 PM, Gary R Hook wrote: On 05/18/2018 11:49 AM, Randy Dunlap wrote: I think the Kconfig option would have been the correct choice. "Preferred", perhaps. Neither is incorrect. And really, the Makefile/Kconfig choice is somewhat separate from the organization issue. So I've made the changes for this. Now I'm waiting on Joerg to make a decision on the code/file organization. I still prefer a separate file for the debug fs code. Joerg: *poke* Any thoughts on this? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] drm/rockchip: vop: fix irq disabled after vop driver probed
From: Sandy HuangThe vop irq is shared between vop and iommu and irq probing in the iommu driver moved to the probe function recently. This can in some cases lead to a stall if the irq is triggered while the vop driver still has it disabled. But there is no real need to disable the irq, as the vop can simply also track its enabled state and ignore irqs in that case. So remove the enable/disable handling and add appropriate condition to the irq handler. Signed-off-by: Sandy Huang [added an actual commit message] Signed-off-by: Heiko Stuebner --- Hi Ezequiel, this patch came from a discussion I had with Rockchip people over the iommu changes and resulting issues back in april, but somehow was forgotten and not posted to the lists. Correcting that now. So removing the enable/disable voodoo on the shared interrupt is the preferred way. drivers/gpu/drm/rockchip/rockchip_drm_vop.c | 13 ++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/rockchip/rockchip_drm_vop.c b/drivers/gpu/drm/rockchip/rockchip_drm_vop.c index 510cdf0..61493d4 100644 --- a/drivers/gpu/drm/rockchip/rockchip_drm_vop.c +++ b/drivers/gpu/drm/rockchip/rockchip_drm_vop.c @@ -549,8 +549,6 @@ static int vop_enable(struct drm_crtc *crtc) spin_unlock(>reg_lock); - enable_irq(vop->irq); - drm_crtc_vblank_on(crtc); return 0; @@ -596,8 +594,6 @@ static void vop_crtc_atomic_disable(struct drm_crtc *crtc, vop_dsp_hold_valid_irq_disable(vop); - disable_irq(vop->irq); - vop->is_enabled = false; /* @@ -1168,6 +1164,13 @@ static irqreturn_t vop_isr(int irq, void *data) int ret = IRQ_NONE; /* +* since the irq is shared with iommu, iommu irq is enabled before vop +* enable, so before vop enable we do nothing. +*/ + if (!vop->is_enabled) + return IRQ_NONE; + + /* * interrupt register has interrupt status, enable and clear bits, we * must hold irq_lock to avoid a race with enable/disable_vblank(). */ @@ -1586,9 +1588,6 @@ static int vop_bind(struct device *dev, struct device *master, void *data) if (ret) goto err_disable_pm_runtime; - /* IRQ is initially disabled; it gets enabled in power_on */ - disable_irq(vop->irq); - return 0; err_disable_pm_runtime: -- 2.7.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v6 0/7] vfio/type1: Add support for valid iova list management
[Cc +Joerg: AMD-Vi observation towards the end] On Wed, 18 Apr 2018 12:40:38 +0100 Shameer Kolothumwrote: > This series introduces an iova list associated with a vfio > iommu. The list is kept updated taking care of iommu apertures, > and reserved regions. Also this series adds checks for any conflict > with existing dma mappings whenever a new device group is attached to > the domain. > > User-space can retrieve valid iova ranges using VFIO_IOMMU_GET_INFO > ioctl capability chains. Any dma map request outside the valid iova > range will be rejected. Hi Shameer, I ran into two minor issues in testing this series, both related to mdev usage of type1. First, in patch 5/7 when we try to validate a dma map request: > +static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu, > + dma_addr_t start, dma_addr_t end) > +{ > + struct list_head *iova = >iova_list; > + struct vfio_iova *node; > + > + list_for_each_entry(node, iova, list) { > + if ((start >= node->start) && (end <= node->end)) > + return true; > + } > + > + return false; > +} A container with only an mdev device will have an empty list because it has not backing iommu to set ranges or reserved regions, so any dma map will fail. I think this is resolved as follows: --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -1100,7 +1100,7 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu, return true; } - return false; + return list_empty(>iova_list); } ie. return false only if there was anything to test against. The second issue is similar, patch 6/7 adds to VFIO_IOMMU_GET_INFO: + ret = vfio_iommu_iova_build_caps(iommu, ); + if (ret) + return ret; And build_caps has: + list_for_each_entry(iova, >iova_list, list) + iovas++; + + if (!iovas) { + ret = -EINVAL; Therefore if the iova list is empty, as for mdevs, the use can no longer even call VFIO_IOMMU_GET_INFO on the container, which is a regression. Again, I think the fix is simple: @@ -2090,7 +2090,7 @@ static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu, iovas++; if (!iovas) { - ret = -EINVAL; + ret = 0; goto out_unlock; } ie. build_caps needs to handle lack of an iova_list as a non-error. Also, I wrote a small unit test to validate the iova list for my systems[1]. With the above changes, my Intel test system gives expected results: # ./vfio-type1-iova-list /sys/bus/mdev/devices/c08db5ed-05d3-4b39-b150-438a18bc698f /sys/bus/pci/devices/:00:1b.0 Adding device: c08db5ed-05d3-4b39-b150-438a18bc698f Initial info struct size: 0x18 No caps Adding device: :00:1b.0 Initial info struct size: 0x18 Requested info struct size: 0x48 New info struct size: 0x48 argsz: 0x48, flags: 0x3, cap_offset: 0x18 00: 4800 0300 00f0 10: 1800 0100 0100 20: 0200 30: dffe f0fe 40: ff01 [cap id: 1, version: 1, next: 0x0] Found type1 iova range version: 1 00: - fedf 01: fef0 - 01ff Adding an mdev device to the container results in no iova list, adding the physical device updates to the expected set with the MSI range excluded. I was a little surprised by an AMD system: # ./vfio-type1-iova-list /sys/bus/pci/devices/:01:00.0 Adding device: :01:00.0 Initial info struct size: 0x18 Requested info struct size: 0x58 New info struct size: 0x58 argsz: 0x58, flags: 0x3, cap_offset: 0x18 00: 5800 0300 00f0 7fff 10: 1800 0100 0100 20: 0300 30: dffe f0fe 40: fc00 0001 50: [cap id: 1, version: 1, next: 0x0] Found type1 iova range version: 1 00: - fedf 01: fef0 - 00fc 02: 0100 - The additional excluded range is the HyperTransport area (which for 99.9+% of use cases isn't going to be a problem) is a rather surprising limit for the size of a VM we can use on AMD, just under 1TB. I fully expect we'll see a bug report from that and we should be thinking about how to address it. Otherwise, if the changes above look good to you I'll incorporate them into their respective patches and push to my next branch. Thanks, Alex [1] https://gist.github.com/awilliam/25e39252ab28101a2d89ace1ff765fdd ___ iommu
[PATCH v2 7/8] iommu: Remove IOMMU_OF_DECLARE
Now that we use the driver core to stop deferred probe for missing drivers, IOMMU_OF_DECLARE can be removed. This is slightly less optimal than having a list of built-in drivers in that we'll now defer probe twice before giving up. This shouldn't have a significant impact on boot times as past discussions about deferred probe have given no evidence of deferred probe having a substantial impact. Cc: Will DeaconCc: Robin Murphy Cc: Joerg Roedel Cc: Marek Szyprowski Cc: Kukjin Kim Cc: Krzysztof Kozlowski Cc: Rob Clark Cc: Heiko Stuebner Cc: Frank Rowand Cc: linux-arm-ker...@lists.infradead.org Cc: iommu@lists.linux-foundation.org Cc: linux-samsung-...@vger.kernel.org Cc: linux-arm-...@vger.kernel.org Cc: linux-rockc...@lists.infradead.org Cc: devicet...@vger.kernel.org Signed-off-by: Rob Herring --- drivers/iommu/arm-smmu-v3.c| 2 -- drivers/iommu/arm-smmu.c | 7 --- drivers/iommu/exynos-iommu.c | 2 -- drivers/iommu/ipmmu-vmsa.c | 3 --- drivers/iommu/msm_iommu.c | 2 -- drivers/iommu/of_iommu.c | 19 +-- drivers/iommu/qcom_iommu.c | 2 -- drivers/iommu/rockchip-iommu.c | 2 -- include/linux/of_iommu.h | 4 9 files changed, 1 insertion(+), 42 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 1d647104bccc..22bdabd3d8e0 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -2915,8 +2915,6 @@ static struct platform_driver arm_smmu_driver = { }; module_platform_driver(arm_smmu_driver); -IOMMU_OF_DECLARE(arm_smmuv3, "arm,smmu-v3"); - MODULE_DESCRIPTION("IOMMU API for ARM architected SMMUv3 implementations"); MODULE_AUTHOR("Will Deacon "); MODULE_LICENSE("GPL v2"); diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index 69e7c60792a8..9dd7cbaa3b0c 100644 --- a/drivers/iommu/arm-smmu.c +++ b/drivers/iommu/arm-smmu.c @@ -2211,13 +2211,6 @@ static struct platform_driver arm_smmu_driver = { }; module_platform_driver(arm_smmu_driver); -IOMMU_OF_DECLARE(arm_smmuv1, "arm,smmu-v1"); -IOMMU_OF_DECLARE(arm_smmuv2, "arm,smmu-v2"); -IOMMU_OF_DECLARE(arm_mmu400, "arm,mmu-400"); -IOMMU_OF_DECLARE(arm_mmu401, "arm,mmu-401"); -IOMMU_OF_DECLARE(arm_mmu500, "arm,mmu-500"); -IOMMU_OF_DECLARE(cavium_smmuv2, "cavium,smmu-v2"); - MODULE_DESCRIPTION("IOMMU API for ARM architected SMMU implementations"); MODULE_AUTHOR("Will Deacon "); MODULE_LICENSE("GPL v2"); diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c index 85879cfec52f..b128cb4372d3 100644 --- a/drivers/iommu/exynos-iommu.c +++ b/drivers/iommu/exynos-iommu.c @@ -1390,5 +1390,3 @@ static int __init exynos_iommu_init(void) return ret; } core_initcall(exynos_iommu_init); - -IOMMU_OF_DECLARE(exynos_iommu_of, "samsung,exynos-sysmmu"); diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c index 40ae6e87cb88..f026aa16d5f1 100644 --- a/drivers/iommu/ipmmu-vmsa.c +++ b/drivers/iommu/ipmmu-vmsa.c @@ -1108,9 +1108,6 @@ static void __exit ipmmu_exit(void) subsys_initcall(ipmmu_init); module_exit(ipmmu_exit); -IOMMU_OF_DECLARE(ipmmu_vmsa_iommu_of, "renesas,ipmmu-vmsa"); -IOMMU_OF_DECLARE(ipmmu_r8a7795_iommu_of, "renesas,ipmmu-r8a7795"); - MODULE_DESCRIPTION("IOMMU API for Renesas VMSA-compatible IPMMU"); MODULE_AUTHOR("Laurent Pinchart "); MODULE_LICENSE("GPL v2"); diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c index 0d3350463a3f..27377742600d 100644 --- a/drivers/iommu/msm_iommu.c +++ b/drivers/iommu/msm_iommu.c @@ -877,7 +877,5 @@ static void __exit msm_iommu_driver_exit(void) subsys_initcall(msm_iommu_driver_init); module_exit(msm_iommu_driver_exit); -IOMMU_OF_DECLARE(msm_iommu_of, "qcom,apq8064-iommu"); - MODULE_LICENSE("GPL v2"); MODULE_AUTHOR("Stepan Moskovchenko "); diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c index 2aac8387717c..1904ccf9fc4e 100644 --- a/drivers/iommu/of_iommu.c +++ b/drivers/iommu/of_iommu.c @@ -27,9 +27,6 @@ #define NO_IOMMU 1 -static const struct of_device_id __iommu_of_table_sentinel - __used __section(__iommu_of_table_end); - /** * of_get_dma_window - Parse *dma-window property and returns 0 if found. * @@ -98,19 +95,6 @@ int of_get_dma_window(struct device_node *dn, const char *prefix, int index, } EXPORT_SYMBOL_GPL(of_get_dma_window); -static bool of_iommu_driver_present(struct device_node *np) -{ - /* -* If the IOMMU still isn't ready by the time we reach init, assume -* it never will be. We don't want to defer indefinitely, nor attempt -* to dereference __iommu_of_table after it's been freed. -*/ - if (system_state >=
[PATCH v2 6/8] iommu: Stop deferring probe at end of initcalls
The IOMMU subsystem has its own mechanism to not defer probe if driver support is missing. Now that the driver core supports stopping deferring probe if drivers aren't built-in (and probed), use the driver core support so the IOMMU specific support can be removed. Cc: Joerg RoedelCc: iommu@lists.linux-foundation.org Signed-off-by: Rob Herring --- drivers/iommu/of_iommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c index 5c36a8b7656a..2aac8387717c 100644 --- a/drivers/iommu/of_iommu.c +++ b/drivers/iommu/of_iommu.c @@ -133,7 +133,7 @@ static int of_iommu_xlate(struct device *dev, * a proper probe-ordering dependency mechanism in future. */ if (!ops) - return -EPROBE_DEFER; + return driver_deferred_probe_check_init_done(dev, true); return ops->of_xlate(dev, iommu_spec); } -- 2.17.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH v3 6/9] mm/vmpressure: update usage of zone modifiers
From: Huaisheng YeUse __GFP_ZONE_MOVABLE to replace (__GFP_HIGHMEM | __GFP_MOVABLE). ___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP bitmasks, the bottom three bits of GFP mask is reserved for storing encoded zone number. __GFP_ZONE_MOVABLE contains encoded ZONE_MOVABLE and __GFP_MOVABLE flag. With GFP_ZONE_TABLE, __GFP_HIGHMEM ORing __GFP_MOVABLE means gfp_zone should return ZONE_MOVABLE. In order to keep that compatible with GFP_ZONE_TABLE, replace (__GFP_HIGHMEM | __GFP_MOVABLE) with __GFP_ZONE_MOVABLE. Signed-off-by: Huaisheng Ye Cc: Andrew Morton Cc: zhongjiang Cc: Minchan Kim Cc: Dan Carpenter Cc: David Rientjes Cc: Christoph Hellwig --- mm/vmpressure.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmpressure.c b/mm/vmpressure.c index 85350ce..30a40e2 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -256,7 +256,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree, * Indirect reclaim (kswapd) sets sc->gfp_mask to GFP_KERNEL, so * we account it too. */ - if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS))) + if (!(gfp & (__GFP_ZONE_MOVABLE | __GFP_IO | __GFP_FS))) return; /* -- 1.8.3.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD
On Thu 24-05-18 08:18:18, Matthew Wilcox wrote: > On Thu, May 24, 2018 at 02:23:23PM +0200, Michal Hocko wrote: > > > If we had eight ZONEs, we could offer: > > > > No, please no more zones. What we have is quite a maint. burden on its > > own. Ideally we should only have lowmem, highmem and special/device > > zones for directly kernel accessible memory, the one that the kernel > > cannot or must not use and completely special memory managed out of > > the page allocator. All the remaining constrains should better be > > implemented on top. > > I believe you when you say that they're a maintenance pain. Is that > maintenance pain because they're so specialised? Well, it used to be LRU balancing which is gone with the node reclaim but that brings new challenges. Now as you say their meaning is not really clear to users and that leads to bugs left and right. > ie if we had more, > could we solve our pain by making them more generic? Well, if you have more you will consume more bits in the struct pages, right? [...] > > But those already do have aproper API, IIUC. So do we really need to > > make our GFP_*/Zone API more complicated than it already is? > > I don't want to change the driver API (setting the DMA mask, etc), > but we don't actually have a good API to the page allocator for the > implementation of dma_alloc_foo() to request pages. More or less, > architectures do: > > if (mask < 4GB) > alloc_page(GFP_DMA) > else if (mask < 64EB) > alloc_page(GFP_DMA32) > else > alloc_page(GFP_HIGHMEM) > > it more-or-less sucks that the devices with 28-bit DMA limits are forced > to allocate from the low 16MB when they're perfectly capable of using the > low 256MB. Do we actually care all that much about those? If yes then we should probably follow the ZONE_DMA (x86) path and use a CMA region for them. I mean most devices should be good with very limited addressability or below 4G, no? -- Michal Hocko SUSE Labs ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD
On Thu, May 24, 2018 at 02:23:23PM +0200, Michal Hocko wrote: > > If we had eight ZONEs, we could offer: > > No, please no more zones. What we have is quite a maint. burden on its > own. Ideally we should only have lowmem, highmem and special/device > zones for directly kernel accessible memory, the one that the kernel > cannot or must not use and completely special memory managed out of > the page allocator. All the remaining constrains should better be > implemented on top. I believe you when you say that they're a maintenance pain. Is that maintenance pain because they're so specialised? ie if we had more, could we solve our pain by making them more generic? > > ZONE_16M// 24 bit > > ZONE_256M // 28 bit > > ZONE_LOWMEM // CONFIG_32BIT only > > ZONE_4G // 32 bit > > ZONE_64G// 36 bit > > ZONE_1T // 40 bit > > ZONE_ALL// everything larger > > ZONE_MOVABLE// movable allocations; no physical address guarantees > > > > #ifdef CONFIG_64BIT > > #define ZONE_NORMAL ZONE_ALL > > #else > > #define ZONE_NORMAL ZONE_LOWMEM > > #endif > > > > This would cover most driver DMA mask allocations; we could tweak the > > offered zones based on analysis of what people need. > > But those already do have aproper API, IIUC. So do we really need to > make our GFP_*/Zone API more complicated than it already is? I don't want to change the driver API (setting the DMA mask, etc), but we don't actually have a good API to the page allocator for the implementation of dma_alloc_foo() to request pages. More or less, architectures do: if (mask < 4GB) alloc_page(GFP_DMA) else if (mask < 64EB) alloc_page(GFP_DMA32) else alloc_page(GFP_HIGHMEM) it more-or-less sucks that the devices with 28-bit DMA limits are forced to allocate from the low 16MB when they're perfectly capable of using the low 256MB. Sure, my proposal doesn't help 27 or 26 bit DMA mask devices, but those are pretty rare. I'm sure you don't need reminding what a mess vmalloc_32 is, and the implementation of saa7146_vmalloc_build_pgtable() just hurts. > > #define GFP_HIGHUSER(GFP_USER | ZONE_ALL) > > #define GFP_HIGHUSER_MOVABLE(GFP_USER | ZONE_MOVABLE) > > > > One other thing I want to see is that fallback from zones happens from > > highest to lowest normally (ie if you fail to allocate in 1T, then you > > try to allocate from 64G), but movable allocations hapen from lowest > > to highest. So ZONE_16M ends up full of page cache pages which are > > readily evictable for the rare occasions when we need to allocate memory > > below 16MB. > > > > I'm sure there are lots of good reasons why this won't work, which is > > why I've been hesitant to propose it before now. > > I am worried you are playing with a can of worms... Yes. Me too. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 13/40] vfio: Add support for Shared Virtual Addressing
On 24/05/18 13:35, Xu Zaibo wrote: >> Right, sva_init() must be called once for any device that intends to use >> bind(). For the second process though, group->sva_enabled will be true >> so we won't call sva_init() again, only bind(). > > Well, while I create mediated devices based on one parent device to support > multiple > processes(a new process will create a new 'vfio_group' for the corresponding > mediated device, > and 'sva_enabled' cannot work any more), in fact, *sva_init and *sva_shutdown > are basically > working on parent device, so, as a result, I just only need sva initiation > and shutdown on the > parent device only once. So I change the two as following: > > @@ -551,8 +565,18 @@ int iommu_sva_device_init(struct device *dev, unsigned > long features, > if (features & ~IOMMU_SVA_FEAT_IOPF) > return -EINVAL; > > +/* If already exists, do nothing */ > +mutex_lock(>iommu_param->lock); > +if (dev->iommu_param->sva_param) { > +mutex_unlock(>iommu_param->lock); > +return 0; > +} > +mutex_unlock(>iommu_param->lock); > > if (features & IOMMU_SVA_FEAT_IOPF) { > ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf, > > > @@ -621,6 +646,14 @@ int iommu_sva_device_shutdown(struct device *dev) > if (!domain) > return -ENODEV; > > +/* If any other process is working on the device, shut down does > nothing. */ > +mutex_lock(>iommu_param->lock); > +if (!list_empty(>iommu_param->sva_param->mm_list)) { > +mutex_unlock(>iommu_param->lock); > +return 0; > +} > +mutex_unlock(>iommu_param->lock); I don't think iommu-sva.c is the best place for this, it's probably better to implement an intermediate layer (the mediating driver), that calls iommu_sva_device_init() and iommu_sva_device_shutdown() once. Then vfio-pci would still call these functions itself, but for mdev the mediating driver keeps a refcount of groups, and calls device_shutdown() only when freeing the last mdev. A device driver (non mdev in this example) expects to be able to free all its resources after sva_device_shutdown() returns. Imagine the mm_list isn't empty (mm_exit() is running late), and instead of waiting in unbind_dev_all() below, we return 0 immediately. Then the calling driver frees its resources, and the mm_exit callback along with private data passed to bind() disappear. If a mm_exit() is still running in parallel, then it will try to access freed data and corrupt memory. So in this function if mm_list isn't empty, the only thing we can do is wait. Thanks, Jean > + > __iommu_sva_unbind_dev_all(dev); > > mutex_lock(>iommu_param->lock); > > I add the above two checkings in both *sva_init and *sva_shutdown, it is > working now, > but i don't know if it will cause any new problems. What's more, i doubt if > it is > reasonable to check this to avoid repeating operation in VFIO, maybe it is > better to check > in IOMMU. :) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 03/40] iommu/sva: Manage process address spaces
On 24/05/18 12:50, Ilias Apalodimas wrote: >> Interesting, I hadn't thought about this use-case before. At first I >> thought you were talking about mdev devices assigned to VMs, but I think >> you're referring to mdevs assigned to userspace drivers instead? Out of >> curiosity, is it only theoretical or does someone actually need this? > > There has been some non upstreamed efforts to have mdev and produce userspace > drivers. Huawei is using it on what they call "wrapdrive" for crypto devices > and > we did a proof of concept for ethernet interfaces. At the time we choose not > to > involve the IOMMU for the reason you mentioned, but having it there would be > good. I'm guessing there were good reasons to do it that way but I wonder, is it not simpler to just have the kernel driver create a /dev/foo, with a standard ioctl/mmap/poll interface? Here VFIO adds a layer of indirection, and since the mediating driver has to implement these operations already, what is gained? Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 03/40] iommu/sva: Manage process address spaces
> Interesting, I hadn't thought about this use-case before. At first I > thought you were talking about mdev devices assigned to VMs, but I think > you're referring to mdevs assigned to userspace drivers instead? Out of > curiosity, is it only theoretical or does someone actually need this? There has been some non upstreamed efforts to have mdev and produce userspace drivers. Huawei is using it on what they call "wrapdrive" for crypto devices and we did a proof of concept for ethernet interfaces. At the time we choose not to involve the IOMMU for the reason you mentioned, but having it there would be good. Thanks Ilias ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v6 05/17] media: rkisp1: add Rockchip ISP1 subdev driver
Hi Tomasz, On Mon, May 07, 2018 at 06:41:50AM +, Tomasz Figa wrote: > On Mon, May 7, 2018 at 3:38 PM Baruch Siachwrote: > > On Mon, May 07, 2018 at 06:13:27AM +, Tomasz Figa wrote: > > > On Thu, May 3, 2018 at 6:09 PM Baruch Siach wrote: > > > > On Thu, Mar 08, 2018 at 05:47:55PM +0800, Jacob Chen wrote: > > > > > +static int rkisp1_isp_sd_s_power(struct v4l2_subdev *sd, int on) > > > > > +{ > > > > > + struct rkisp1_device *isp_dev = sd_to_isp_dev(sd); > > > > > + int ret; > > > > > + > > > > > + v4l2_dbg(1, rkisp1_debug, _dev->v4l2_dev, "s_power: %d\n", > > > on); > > > > > + > > > > > + if (on) { > > > > > + ret = pm_runtime_get_sync(isp_dev->dev); > > > > > + if (ret < 0) > > > > > + return ret; > > > > > + > > > > > + rkisp1_config_clk(isp_dev); > > > > > + } else { > > > > > + ret = pm_runtime_put(isp_dev->dev); > > > > > > > I commented this line out to make more than one STREAMON work. > Otherwise, > > > > the second STREAMON hangs. I guess the bug is not this driver. > Probably > > > > something in drivers/soc/rockchip/pm_domains.c. Just noting that in > case > > > > you or someone on Cc would like to investigate it further. > > > > > > > > I tested v4.16-rc4 on the Tinkerboard. > > > > > > Looks like that version doesn't include the IOMMU PM and clock handling > > > rework [1], which should fix a lot of runtime PM issues. FWIW, > linux-next > > > seems to already include it. > > > > > > [1] https://lkml.org/lkml/2018/3/23/44 > > > Thanks for the reference. > > > It looks like the iommu driver part is in Linus' tree already. The DT > part is > > in the v4.18-armsoc/dts32 branch of Heiko's tree. Am I missing anything? > > You're right, most of the series made it in time for 4.17. However, the DT > part (precisely, the clocks properties added to IOMMU nodes) is crucial for > the fixes to be effective. > > > Anyway, I'll take a look. > > Thanks for testing. :) (Forgot to mention in my previous email...) I finally got around to testing. Unfortunately, kernel v4.17-rc6 with cherry-pick of commit c78751f91c0b (ARM: dts: rockchip: add clocks in iommu nodes) from Heiko's tree still exhibit the same problem. STREAMON hangs on second try. The same workaround "fixes" it. baruch -- http://baruch.siach.name/blog/ ~. .~ Tk Open Systems =}ooO--U--Ooo{= - bar...@tkos.co.il - tel: +972.52.368.4656, http://www.tkos.co.il - ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4] dt-bindings: mediatek: Add binding for mt2712 IOMMU and SMI
This patch adds decriptions for mt2712 IOMMU and SMI. In order to balance the bandwidth, mt2712 has two M4Us, two smi-commons, 10 smi-larbs. and mt2712 is also MTK IOMMU gen2 which uses ARM Short-Descriptor translation table format. The mt2712 M4U-SMI HW diagram is as below: EMI | | | M4U0 M4U1 | | smi-common0smi-common1 | | - | | | | | | || | | | | | | | | || | | larb0 larb1 larb2 larb3 larb6larb4larb5larb7 larb8 larb9 disp0 vdec cam venc jpg mdp1/disp1 mdp2/disp2 mdp3 vdo/nr tvd All the connections are HW fixed, SW can NOT adjust it. Signed-off-by: Yong WuAcked-by: Rob Herring --- change notes: v4: change the license of the new file in this patch to SPDX. v3: http://lists.infradead.org/pipermail/linux-mediatek/2018-May/013279.html Add a new ECO port(DISP_RDMA2) in larb0/port7. v2: https://lists.linuxfoundation.org/pipermail/iommu/2017-August/023848.html v1: https://lists.linuxfoundation.org/pipermail/iommu/2017-August/023665.html --- .../devicetree/bindings/iommu/mediatek,iommu.txt | 6 +- .../memory-controllers/mediatek,smi-common.txt | 6 +- .../memory-controllers/mediatek,smi-larb.txt | 5 +- include/dt-bindings/memory/mt2712-larb-port.h | 95 ++ 4 files changed, 106 insertions(+), 6 deletions(-) create mode 100644 include/dt-bindings/memory/mt2712-larb-port.h diff --git a/Documentation/devicetree/bindings/iommu/mediatek,iommu.txt b/Documentation/devicetree/bindings/iommu/mediatek,iommu.txt index 53c20ca..df5db73 100644 --- a/Documentation/devicetree/bindings/iommu/mediatek,iommu.txt +++ b/Documentation/devicetree/bindings/iommu/mediatek,iommu.txt @@ -40,6 +40,7 @@ video decode local arbiter, all these ports are according to the video HW. Required properties: - compatible : must be one of the following string: "mediatek,mt2701-m4u" for mt2701 which uses generation one m4u HW. + "mediatek,mt2712-m4u" for mt2712 which uses generation two m4u HW. "mediatek,mt8173-m4u" for mt8173 which uses generation two m4u HW. - reg : m4u register base and size. - interrupts : the interrupt of m4u. @@ -50,8 +51,9 @@ Required properties: according to the local arbiter index, like larb0, larb1, larb2... - iommu-cells : must be 1. This is the mtk_m4u_id according to the HW. Specifies the mtk_m4u_id as defined in - dt-binding/memory/mt2701-larb-port.h for mt2701 and - dt-binding/memory/mt8173-larb-port.h for mt8173 + dt-binding/memory/mt2701-larb-port.h for mt2701, + dt-binding/memory/mt2712-larb-port.h for mt2712, and + dt-binding/memory/mt8173-larb-port.h for mt8173. Example: iommu: iommu@10205000 { diff --git a/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.txt b/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.txt index aa614b2..615abdd 100644 --- a/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.txt +++ b/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.txt @@ -2,8 +2,9 @@ SMI (Smart Multimedia Interface) Common The hardware block diagram please check bindings/iommu/mediatek,iommu.txt -Mediatek SMI have two generations of HW architecture, mt8173 uses the second -generation of SMI HW while mt2701 uses the first generation HW of SMI. +Mediatek SMI have two generations of HW architecture, mt2712 and mt8173 use +the second generation of SMI HW while mt2701 uses the first generation HW of +SMI. There's slight differences between the two SMI, for generation 2, the register which control the iommu port is at each larb's register base. But @@ -15,6 +16,7 @@ not needed for SMI generation 2. Required properties: - compatible : must be one of : "mediatek,mt2701-smi-common" + "mediatek,mt2712-smi-common" "mediatek,mt8173-smi-common" - reg : the register and size of the SMI block. - power-domains : a phandle to the power domain of this local arbiter. diff --git a/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-larb.txt b/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-larb.txt index ddf46b8..083155c 100644 --- a/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-larb.txt +++ b/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-larb.txt @@ -4,8 +4,9 @@ The hardware block diagram please check bindings/iommu/mediatek,iommu.txt Required properties: -
Re: [PATCH v2 13/40] vfio: Add support for Shared Virtual Addressing
On 2018/5/24 19:44, Jean-Philippe Brucker wrote: Hi, On 23/05/18 10:38, Xu Zaibo wrote: +static int vfio_iommu_bind_group(struct vfio_iommu *iommu, + struct vfio_group *group, + struct vfio_mm *vfio_mm) +{ +int ret; +bool enabled_sva = false; +struct vfio_iommu_sva_bind_data data = { +.vfio_mm= vfio_mm, +.iommu= iommu, +.count= 0, +}; + +if (!group->sva_enabled) { +ret = iommu_group_for_each_dev(group->iommu_group, NULL, + vfio_iommu_sva_init); Do we need to do *sva_init here or do anything to avoid repeated initiation? while another process already did initiation at this device, I think that current process will get an EEXIST. Right, sva_init() must be called once for any device that intends to use bind(). For the second process though, group->sva_enabled will be true so we won't call sva_init() again, only bind(). Well, while I create mediated devices based on one parent device to support multiple processes(a new process will create a new 'vfio_group' for the corresponding mediated device, and 'sva_enabled' cannot work any more), in fact, *sva_init and *sva_shutdown are basically working on parent device, so, as a result, I just only need sva initiation and shutdown on the parent device only once. So I change the two as following: /@@ -551,8 +565,18 @@ int iommu_sva_device_init(struct device *dev, unsigned long features,/ if (features & ~IOMMU_SVA_FEAT_IOPF) return -EINVAL; /+/* If already exists, do nothing */// //+mutex_lock(>iommu_param->lock);// //+if (dev->iommu_param->sva_param) {// //+mutex_unlock(>iommu_param->lock);// //+return 0;// //+}// //+mutex_unlock(>iommu_param->lock);// // if (features & IOMMU_SVA_FEAT_IOPF) {// // ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf,/ / // // //@@ -621,6 +646,14 @@ int iommu_sva_device_shutdown(struct device *dev)// // if (!domain)// // return -ENODEV;// //+/* If any other process is working on the device, shut down does nothing. */// //+mutex_lock(>iommu_param->lock);// //+if (!list_empty(>iommu_param->sva_param->mm_list)) {// //+mutex_unlock(>iommu_param->lock);// //+return 0;// //+}// //+mutex_unlock(>iommu_param->lock);// //+// // __iommu_sva_unbind_dev_all(dev);// // mutex_lock(>iommu_param->lock);/ I add the above two checkings in both *sva_init and *sva_shutdown, it is working now, but i don't know if it will cause any new problems. What's more, i doubt if it is reasonable to check this to avoid repeating operation in VFIO, maybe it is better to check in IOMMU. :) Thanks Zaibo . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD
On Wed 23-05-18 22:19:19, Matthew Wilcox wrote: > On Tue, May 22, 2018 at 08:37:28PM +0200, Michal Hocko wrote: > > So why is this any better than the current code. Sure I am not a great > > fan of GFP_ZONE_TABLE because of how it is incomprehensible but this > > doesn't look too much better, yet we are losing a check for incompatible > > gfp flags. The diffstat looks really sound but then you just look and > > see that the large part is the comment that at least explained the gfp > > zone modifiers somehow and the debugging code. So what is the selling > > point? > > I have a plan, but it's not exactly fully-formed yet. > > One of the big problems we have today is that we have a lot of users > who have constraints on the physical memory they want to allocate, > but we have very limited abilities to provide them with what they're > asking for. The various different ZONEs have different meanings on > different architectures and are generally a mess. Agreed. > If we had eight ZONEs, we could offer: No, please no more zones. What we have is quite a maint. burden on its own. Ideally we should only have lowmem, highmem and special/device zones for directly kernel accessible memory, the one that the kernel cannot or must not use and completely special memory managed out of the page allocator. All the remaining constrains should better be implemented on top. > ZONE_16M // 24 bit > ZONE_256M // 28 bit > ZONE_LOWMEM // CONFIG_32BIT only > ZONE_4G // 32 bit > ZONE_64G // 36 bit > ZONE_1T // 40 bit > ZONE_ALL // everything larger > ZONE_MOVABLE // movable allocations; no physical address guarantees > > #ifdef CONFIG_64BIT > #define ZONE_NORMAL ZONE_ALL > #else > #define ZONE_NORMAL ZONE_LOWMEM > #endif > > This would cover most driver DMA mask allocations; we could tweak the > offered zones based on analysis of what people need. But those already do have aproper API, IIUC. So do we really need to make our GFP_*/Zone API more complicated than it already is? > #define GFP_HIGHUSER (GFP_USER | ZONE_ALL) > #define GFP_HIGHUSER_MOVABLE (GFP_USER | ZONE_MOVABLE) > > One other thing I want to see is that fallback from zones happens from > highest to lowest normally (ie if you fail to allocate in 1T, then you > try to allocate from 64G), but movable allocations hapen from lowest > to highest. So ZONE_16M ends up full of page cache pages which are > readily evictable for the rare occasions when we need to allocate memory > below 16MB. > > I'm sure there are lots of good reasons why this won't work, which is > why I've been hesitant to propose it before now. I am worried you are playing with a can of worms... -- Michal Hocko SUSE Labs ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [External] Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD
On Wed 23-05-18 16:07:16, Huaisheng HS1 Ye wrote: > From: Michal Hocko [mailto:mho...@kernel.org] > Sent: Wednesday, May 23, 2018 2:37 AM > > > > On Mon 21-05-18 23:20:21, Huaisheng Ye wrote: > > > From: Huaisheng Ye> > > > > > Replace GFP_ZONE_TABLE and GFP_ZONE_BAD with encoded zone number. > > > > > > Delete ___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 from GFP bitmasks, > > > the bottom three bits of GFP mask is reserved for storing encoded > > > zone number. > > > > > > The encoding method is XOR. Get zone number from enum zone_type, > > > then encode the number with ZONE_NORMAL by XOR operation. > > > The goal is to make sure ZONE_NORMAL can be encoded to zero. So, > > > the compatibility can be guaranteed, such as GFP_KERNEL and GFP_ATOMIC > > > can be used as before. > > > > > > Reserve __GFP_MOVABLE in bit 3, so that it can continue to be used as > > > a flag. Same as before, __GFP_MOVABLE respresents movable migrate type > > > for ZONE_DMA, ZONE_DMA32, and ZONE_NORMAL. But when it is enabled with > > > __GFP_HIGHMEM, ZONE_MOVABLE shall be returned instead of ZONE_HIGHMEM. > > > __GFP_ZONE_MOVABLE is created to realize it. > > > > > > With this patch, just enabling __GFP_MOVABLE and __GFP_HIGHMEM is not > > > enough to get ZONE_MOVABLE from gfp_zone. All callers should use > > > GFP_HIGHUSER_MOVABLE or __GFP_ZONE_MOVABLE directly to achieve that. > > > > > > Decode zone number directly from bottom three bits of flags in gfp_zone. > > > The theory of encoding and decoding is, > > > A ^ B ^ B = A > > > > So why is this any better than the current code. Sure I am not a great > > fan of GFP_ZONE_TABLE because of how it is incomprehensible but this > > doesn't look too much better, yet we are losing a check for incompatible > > gfp flags. The diffstat looks really sound but then you just look and > > see that the large part is the comment that at least explained the gfp > > zone modifiers somehow and the debugging code. So what is the selling > > point? > > Dear Michal, > > Let me try to reply your questions. > Exactly, GFP_ZONE_TABLE is too complicated. I think there are two advantages > from the series of patches. > > 1. XOR operation is simple and efficient, GFP_ZONE_TABLE/BAD need to do twice > shift operations, the first is for getting a zone_type and the second is for > checking the to be returned type is a correct or not. But with these patch XOR > operation just needs to use once. Because the bottom 3 bits of GFP bitmask > have > been used to represent the encoded zone number, we can say there is no bad > zone > number if all callers could use it without buggy way. Of course, the returned > zone type in gfp_zone needs to be no more than ZONE_MOVABLE. But you are losing the ability to check for wrong usage. And it seems that the sad reality is that the existing code do screw up. > 2. GFP_ZONE_TABLE has limit with the amount of zone types. Current > GFP_ZONE_TABLE > is 32 bits, in general, there are 4 zone types for most ofX86_64 platform, > they > are ZONE_DMA, ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE. If we want to expand > the > amount of zone types to larger than 4, the zone shift should be 3. But we do not want to expand the number of zones IMHO. The existing zoo is quite a maint. pain. That being said. I am not saying that I am in love with GFP_ZONE_TABLE. It always makes my head explode when I look there but it seems to work with the current code and it is optimized for it. If you want to change this then you should make sure you describe reasons _why_ this is an improvement. And I would argue that "we can have more zones" is a relevant one. -- Michal Hocko SUSE Labs ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 13/40] vfio: Add support for Shared Virtual Addressing
Hi, On 23/05/18 10:38, Xu Zaibo wrote: >> +static int vfio_iommu_bind_group(struct vfio_iommu *iommu, >> + struct vfio_group *group, >> + struct vfio_mm *vfio_mm) >> +{ >> + int ret; >> + bool enabled_sva = false; >> + struct vfio_iommu_sva_bind_data data = { >> + .vfio_mm = vfio_mm, >> + .iommu = iommu, >> + .count = 0, >> + }; >> + >> + if (!group->sva_enabled) { >> + ret = iommu_group_for_each_dev(group->iommu_group, NULL, >> + vfio_iommu_sva_init); > Do we need to do *sva_init here or do anything to avoid repeated > initiation? > while another process already did initiation at this device, I think > that current process will get an EEXIST. Right, sva_init() must be called once for any device that intends to use bind(). For the second process though, group->sva_enabled will be true so we won't call sva_init() again, only bind(). Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 07/40] iommu: Add a page fault handler
On 23/05/18 00:35, Jacob Pan wrote: + /* Insert *before* the last fault */ + list_move(>head, >faults); + } + >>> If you already sorted the group list with last fault at the end, >>> why do you need a separate entry to track the last fault? >> >> Not sure I understand your question, sorry. Do you mean why the >> iopf_group.last_fault? Just to avoid one more kzalloc. >> > kind of :) what i thought was that why can't the last_fault naturally > be the last entry in your group fault list? then there is no need for > special treatment in terms of allocation of the last fault. just my > preference. But we need a kzalloc for the last fault anyway, so I thought I'd just piggy-back on the group allocation. And if I don't add the last fault at the end of group->faults, then it's iopf_handle_group that requires special treatment. I'm still not sure I understood your question though, could you send me a patch that does it? + + queue->flush(queue->flush_arg, dev); + + /* + * No need to clear the partial list. All PRGs containing the PASID that + * needs to be decommissioned are whole (the device driver made sure of + * it before this function was called). They have been submitted to the + * queue by the above flush(). + */ >>> So you are saying device driver need to make sure LPIG PRQ is >>> submitted in the flush call above such that partial list is >>> cleared? >> >> Not exactly, it's the IOMMU driver that makes sure all LPIG in its >> queues are submitted by the above flush call. In more details the >> flow is: >> >> * Either device driver calls unbind()/sva_device_shutdown(), or the >> process exits. >> * If the device driver called, then it already told the device to stop >> using the PASID. Otherwise we use the mm_exit() callback to tell the >> device driver to stop using the PASID. >> * In either case, when receiving a stop request from the driver, the >> device sends the LPIGs to the IOMMU queue. >> * Then, the flush call above ensures that the IOMMU reports the LPIG >> with iommu_report_device_fault. >> * While submitting all LPIGs for this PASID to the work queue, >> ipof_queue_fault also picked up all partial faults, so the partial >> list is clean. >> >> Maybe I should improve this comment? >> > thanks for explaining. LPIG submission is done by device asynchronously > w.r.t. driver stopping/decommission PASID. Hmm, it should really be synchronous, otherwise there is no way to know when the PASID can be decommissioned. We need a guarantee such as the one in 6.20.1 of the PCIe spec, "Managing PASID TLP Prefix Usage": "When the stop request mechanism indicates completion, the Function has: * Completed all Non-Posted Requests associated with this PASID. * Flushed to the host all Posted Requests addressing host memory in all TCs that were used by the PASID." That's in combination with "The function shall [...] finish transmitting any multi-page Page Request Messages for this PASID (i.e. send the Page Request Message with the L bit Set)." from the ATS spec. (If I remember correctly a PRI Page Request is a Posted Request.) Only after this stop request completes can the driver call unbind(), or return from exit_mm(). Then we know that if there was a LPIG for that PASID, it is in the IOMMU's PRI queue (or already completed) once we call flush(). > so if we were to use this > flow on vt-d, which does not stall page request queue, then we should > use the iommu model specific flush() callback to ensure LPIG is > received? There could be queue full condition and retry. I am just > trying to understand how and where we can make sure LPIG is > received and all groups are whole. For SMMU in patch 30, the flush() callback waits until the PRI queue is empty or, when the PRI thread is running in parallel, until the thread has done a full circle (handled as many faults as the queue size). It's really unpleasant to implement because the flush() callback takes a lock to inspect the hardware state, but I don't think we have a choice. Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 03/40] iommu/sva: Manage process address spaces
On 22/05/18 17:43, Jacob Pan wrote: > On Thu, 17 May 2018 11:02:42 +0100 > Jean-Philippe Bruckerwrote: > >> On 17/05/18 00:31, Jacob Pan wrote: >>> On Fri, 11 May 2018 20:06:04 +0100 >>> I am a little confused about domain vs. pasid relationship. If >>> each domain represents a address space, should there be a domain for >>> each pasid? >> >> I don't think there is a formal definition, but from previous >> discussion the consensus seems to be: domains are a collection of >> devices that have the same virtual address spaces (one or many). >> >> Keeping that definition makes things easier, in my opinion. Some time >> ago, I did try to represent PASIDs using "subdomains" (introducing a >> hierarchy of struct iommu_domain), but it required invasive changes in >> the IOMMU subsystem and probably all over the tree. >> >> You do need some kind of "root domain" for each device, so that >> "iommu_get_domain_for_dev()" still makes sense. That root domain >> doesn't have a single address space but a collection of subdomains. >> If you need this anyway, representing a PASID with an iommu_domain >> doesn't seem preferable than using a different structure (io_mm), >> because they don't have anything in common. >> > My main concern is the PASID table storage. If PASID table storage > is tied to domain, it is ok to scale up, i.e. multiple devices in a > domain share a single PASID table. But to scale down, e.g. further > partition device with VFIO mdev for assignment, each mdev may get its > own domain via vfio. But there is no IOMMU storage for PASID table at > mdev device level. Perhaps we do need root domain or some parent-child > relationship to locate PASID table. Interesting, I hadn't thought about this use-case before. At first I thought you were talking about mdev devices assigned to VMs, but I think you're referring to mdevs assigned to userspace drivers instead? Out of curiosity, is it only theoretical or does someone actually need this? I don't think mdev for VM assignment are compatible with PASID, at least not when the IOMMU is involved. I usually ignore mdev in my reasoning because, as far as I know, currently they only affect devices that have their own MMU, and IOMMU domains don't come into play. However, if a device was backed by the IOMMU, and the device driver wanted to partition it into mdevs, then users would be tempted to assign mdev1 to VM1 and mdev2 to VM2. It doesn't work with PASID, because the PASID spaces of VM1 and VM2 will conflict. If both VM1 and VM2 allocate PASID #1, then the host has to somehow arbitrate device accesses, for example scheduling first mdev1 then mdev2. That's possible if the device driver is in charge of the MMU, but not really compatible with the IOMMU. So in the IOMMU subsystem, for assigning devices to VMs the only model that makes sense is SR-IOV, where each VF/mdev has its own RID and its own PASID table. In that case you'd get one IOMMU domain per VF. But considering userspace drivers in the host alone, it might make sense to partition a device into mdevs and assign them to multiple processes. Interestingly this scheme still doesn't work with the classic MAP/UNMAP ioctl: since there is a single device context entry for all mdevs, the mediating driver would have to catch all MAP/UNMAP ioctls and reject those with IOVAs that overlap those of another mdev. It's doesn't seem viable. But when using PASID then each mdev has its own address space, and since PASIDs are allocated by the kernel there is no such conflict. Anyway, I think this use-case can work with the current structures, if mediating driver does the bind() instead of VFIO. That's necessary because you can't let userspace program the PASID into the device, or they would be able to access address spaces owned by other mdevs. Then the mediating driver does the bind(), and keeps internal structures to associate the process to the given mdev. Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 1/2] iommu - Enable debugfs exposure of IOMMU driver internals
On Mon, May 14, 2018 at 12:20:20PM -0500, Gary R Hook wrote: > Provide base enablement for using debugfs to expose internal data of an > IOMMU driver. When called, create the /sys/kernel/debug/iommu directory. > > Emit a strong warning at boot time to indicate that this feature is > enabled. > > This function is called from iommu_init, and creates the initial DebugFS > directory. Drivers may then call iommu_debugfs_new_driver_dir() to > instantiate a device-specific directory to expose internal data. > It will return a pointer to the new dentry structure created in > /sys/kernel/debug/iommu, or NULL in the event of a failure. > > Since the IOMMU driver can not be removed from the running system, there > is no need for an "off" function. > > Signed-off-by: Gary R Hook> --- > drivers/iommu/Kconfig | 10 ++ > drivers/iommu/Makefile|1 + > drivers/iommu/iommu-debugfs.c | 72 > + > drivers/iommu/iommu.c |2 + > include/linux/iommu.h | 11 ++ > 5 files changed, 96 insertions(+) > create mode 100644 drivers/iommu/iommu-debugfs.c > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig > index f3a21343e636..2eab6a849a0a 100644 > --- a/drivers/iommu/Kconfig > +++ b/drivers/iommu/Kconfig > @@ -60,6 +60,16 @@ config IOMMU_IO_PGTABLE_ARMV7S_SELFTEST > > endmenu > > +config IOMMU_DEBUGFS > + bool "Export IOMMU internals in DebugFS" > + depends on DEBUG_FS > + help > + Allows exposure of IOMMU device internals. This option enables > + the use of debugfs by IOMMU drivers as required. Devices can, > + at initialization time, cause the IOMMU code to create a top-level > + debug/iommu directory, and then populate a subdirectory with > + entries as required. You need a big "DO NOT ENABLE THIS OPTION UNLESS YOU REALLY REALLY KNOW WHAT YOU ARE DOING!!!" line here. To match up with your crazy kernel log warning. Otherwise distros will turn this on, I guarantee it. > + > config IOMMU_IOVA > tristate > > diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile > index 1fb695854809..74cfbc392862 100644 > --- a/drivers/iommu/Makefile > +++ b/drivers/iommu/Makefile > @@ -2,6 +2,7 @@ > obj-$(CONFIG_IOMMU_API) += iommu.o > obj-$(CONFIG_IOMMU_API) += iommu-traces.o > obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o > +obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o > obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o > obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o > obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o > diff --git a/drivers/iommu/iommu-debugfs.c b/drivers/iommu/iommu-debugfs.c > new file mode 100644 > index ..bb1bf2d0ac51 > --- /dev/null > +++ b/drivers/iommu/iommu-debugfs.c > @@ -0,0 +1,72 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * IOMMU debugfs core infrastructure > + * > + * Copyright (C) 2018 Advanced Micro Devices, Inc. > + * > + * Author: Gary R Hook > + */ > + > +#include > +#include > +#include > + > +static struct dentry *iommu_debugfs_dir; > + > +/** > + * iommu_debugfs_setup - create the top-level iommu directory in debugfs > + * > + * Provide base enablement for using debugfs to expose internal data of an > + * IOMMU driver. When called, this function creates the > + * /sys/kernel/debug/iommu directory. > + * > + * Emit a strong warning at boot time to indicate that this feature is > + * enabled. > + * > + * This function is called from iommu_init; drivers may then call > + * iommu_debugfs_new_driver_dir() to instantiate a vendor-specific > + * directory to be used to expose internal data. > + */ > +void iommu_debugfs_setup(void) > +{ > + if (!iommu_debugfs_dir) { > + iommu_debugfs_dir = debugfs_create_dir("iommu", NULL); > + if (iommu_debugfs_dir) { No need to care about the return value, just keep going. Nothing you should, or should not, do depending on the return value of a debugfs call. > + pr_warn("\n"); > + > pr_warn("*\n"); > + pr_warn("** NOTICE NOTICE NOTICE NOTICE NOTICE > NOTICE NOTICE**\n"); > + pr_warn("** > **\n"); > + pr_warn("** IOMMU DebugFS SUPPORT HAS BEEN ENABLED IN > THIS KERNEL **\n"); > + pr_warn("** > **\n"); > + pr_warn("** This means that this kernel is built to > expose internal **\n"); > + pr_warn("** IOMMU data structures, which may compromise > security on **\n"); > + pr_warn("** your system. > **\n"); > + pr_warn("** > **\n"); > +
Re: [PATCH v3] dma-debug: Check scatterlist segments
Thanks Robin, applied to the dma-mapping tree. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu