RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
On Wed, Sep 29 2021, "Tian, Kevin" wrote: >> From: David Gibson >> Sent: Wednesday, September 29, 2021 10:44 AM >> >> > One alternative option is to arrange device nodes in sub-directories based >> > on the device type. But doing so also adds one trouble to userspace. The >> > current vfio uAPI is designed to have the user query device type via >> > VFIO_DEVICE_GET_INFO after opening the device. With this option the user >> > instead needs to figure out the device type before opening the device, to >> > identify the sub-directory. >> >> Wouldn't this be up to the operator / configuration, rather than the >> actual software though? I would assume that typically the VFIO >> program would be pointed at a specific vfio device node file to use, >> e.g. >> my-vfio-prog -d /dev/vfio/pci/:0a:03.1 >> >> Or more generally, if you're expecting userspace to know a name in a >> uniqu pattern, they can equally well know a "type/name" pair. >> > > You are correct. Currently: > > -device, vfio-pci,host=:BB:DD.F > -device, vfio-pci,sysfdev=/sys/bus/pci/devices/ :BB:DD.F > -device, vfio-platform,sysdev=/sys/bus/platform/devices/PNP0103:00 > > above is definitely type/name information to find the related node. > > Actually even for Jason's proposal we still need such information to > identify the sysfs path. > > Then I feel type-based sub-directory does work. Adding another link > to sysfs sounds unnecessary now. But I'm not sure whether we still > want to create /dev/vfio/devices/vfio0 thing and related udev rule > thing that you pointed out in another mail. Still reading through this whole thread, but type-based subdirectories also make the most sense to me. I don't really see userspace wanting to grab just any device and then figure out whether it is the device it was looking for, but rather immediately go to a specific device or at least a device of a specific type. Sequentially-numbered devices tend to become really unwieldy in my experience if you are working on a system with loads of devices. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
On Wed, Sep 29, 2021 at 05:38:56AM +, Tian, Kevin wrote: > > From: David Gibson > > Sent: Wednesday, September 29, 2021 12:56 PM > > > > > > > > Unlike vfio, iommufd adopts a device-centric design with all group > > > logistics hidden behind the fd. Binding a device to iommufd serves > > > as the contract to get security context established (and vice versa > > > for unbinding). One additional requirement in iommufd is to manage the > > > switch between multiple security contexts due to decoupled bind/attach: > > > > > > 1) Open a device in "/dev/vfio/devices" with user access blocked; > > > > Probably worth clarifying that (1) must happen for *all* devices in > > the group before (2) happens for any device in the group. > > No. User access is naturally blocked for other devices as long as they > are not opened yet. Uh... my point is that everything in the group has to be removed from regular kernel drivers before we reach step (2). Is the plan that you must do that before you can even open them? That's a reasonable choice, but then I think you should show that step in this description as well. > > > 2) Bind the device to an iommufd with an initial security context > > > (an empty iommu domain which blocks dma) established for its > > > group, with user access unblocked; > > > > > > 3) Attach the device to a user-specified ioasid (shared by all devices > > > attached to this ioasid). Before attaching, the device should be first > > > detached from the initial context; > > > > So, this step can implicitly but observably change the behaviour for > > other devices in the group as well. I don't love that kind of > > difficult to predict side effect, which is why I'm *still* not totally > > convinced by the device-centric model. > > which side-effect is predicted here? The user anyway needs to be > aware of such group restriction regardless whether it uses group > or nongroup interface. Yes, exactly. And with a group interface it's obvious it has to understand it. With the non-group interface, you can get to this stage in ignorance of groups. It will even work as long as you are lucky enough only to try with singleton-group devices. Then you try it with two devices in the one group and doing (3) on device A will implicitly change the DMA environment of device B. (or at least, it will if they share a group because they don't have distinguishable RIDs. That's not the only multi-device group case, but it's one of them). -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 1/1] iommu/vt-d: Delete dev_has_feat callback
The commit 262948f8ba573 ("iommu: Delete iommu_dev_has_feature()") has deleted the iommu_dev_has_feature() interface. Remove the dev_has_feat callback to avoid dead code. Signed-off-by: Lu Baolu --- drivers/iommu/intel/iommu.c | 59 - 1 file changed, 5 insertions(+), 54 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 01162b0b9712..dc733de9eea5 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -5516,62 +5516,14 @@ static int intel_iommu_disable_sva(struct device *dev) return ret; } -/* - * A PCI express designated vendor specific extended capability is defined - * in the section 3.7 of Intel scalable I/O virtualization technical spec - * for system software and tools to detect endpoint devices supporting the - * Intel scalable IO virtualization without host driver dependency. - * - * Returns the address of the matching extended capability structure within - * the device's PCI configuration space or 0 if the device does not support - * it. - */ -static int siov_find_pci_dvsec(struct pci_dev *pdev) -{ - int pos; - u16 vendor, id; - - pos = pci_find_next_ext_capability(pdev, 0, 0x23); - while (pos) { - pci_read_config_word(pdev, pos + 4, &vendor); - pci_read_config_word(pdev, pos + 8, &id); - if (vendor == PCI_VENDOR_ID_INTEL && id == 5) - return pos; - - pos = pci_find_next_ext_capability(pdev, pos, 0x23); - } - - return 0; -} - -static bool -intel_iommu_dev_has_feat(struct device *dev, enum iommu_dev_features feat) +static int intel_iommu_enable_iopf(struct device *dev) { struct device_domain_info *info = get_domain_info(dev); - if (feat == IOMMU_DEV_FEAT_AUX) { - int ret; - - if (!dev_is_pci(dev) || dmar_disabled || - !scalable_mode_support() || !pasid_mode_support()) - return false; - - ret = pci_pasid_features(to_pci_dev(dev)); - if (ret < 0) - return false; - - return !!siov_find_pci_dvsec(to_pci_dev(dev)); - } - - if (feat == IOMMU_DEV_FEAT_IOPF) - return info && info->pri_supported; - - if (feat == IOMMU_DEV_FEAT_SVA) - return info && (info->iommu->flags & VTD_FLAG_SVM_CAPABLE) && - info->pasid_supported && info->pri_supported && - info->ats_supported; + if (info && info->pri_supported) + return 0; - return false; + return -ENODEV; } static int @@ -5582,7 +5534,7 @@ intel_iommu_dev_enable_feat(struct device *dev, enum iommu_dev_features feat) return intel_iommu_enable_auxd(dev); case IOMMU_DEV_FEAT_IOPF: - return intel_iommu_dev_has_feat(dev, feat) ? 0 : -ENODEV; + return intel_iommu_enable_iopf(dev); case IOMMU_DEV_FEAT_SVA: return intel_iommu_enable_sva(dev); @@ -5708,7 +5660,6 @@ const struct iommu_ops intel_iommu_ops = { .get_resv_regions = intel_iommu_get_resv_regions, .put_resv_regions = generic_iommu_put_resv_regions, .device_group = intel_iommu_device_group, - .dev_has_feat = intel_iommu_dev_has_feat, .dev_feat_enabled = intel_iommu_dev_feat_enabled, .dev_enable_feat= intel_iommu_dev_enable_feat, .dev_disable_feat = intel_iommu_dev_disable_feat, -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
> From: David Gibson > Sent: Wednesday, September 29, 2021 2:35 PM > > On Wed, Sep 29, 2021 at 05:38:56AM +, Tian, Kevin wrote: > > > From: David Gibson > > > Sent: Wednesday, September 29, 2021 12:56 PM > > > > > > > > > > > Unlike vfio, iommufd adopts a device-centric design with all group > > > > logistics hidden behind the fd. Binding a device to iommufd serves > > > > as the contract to get security context established (and vice versa > > > > for unbinding). One additional requirement in iommufd is to manage > the > > > > switch between multiple security contexts due to decoupled > bind/attach: > > > > > > > > 1) Open a device in "/dev/vfio/devices" with user access blocked; > > > > > > Probably worth clarifying that (1) must happen for *all* devices in > > > the group before (2) happens for any device in the group. > > > > No. User access is naturally blocked for other devices as long as they > > are not opened yet. > > Uh... my point is that everything in the group has to be removed from > regular kernel drivers before we reach step (2). Is the plan that you > must do that before you can even open them? That's a reasonable > choice, but then I think you should show that step in this description > as well. Agree. I think below proposal can meet above requirement and ensure it's not broken in the whole process when the group is operated by the userspace: https://lore.kernel.org/kvm/20210928140712.gl964...@nvidia.com/ and definitely an updated description will be provided when sending out the new proposal. > > > > > 2) Bind the device to an iommufd with an initial security context > > > > (an empty iommu domain which blocks dma) established for its > > > > group, with user access unblocked; > > > > > > > > 3) Attach the device to a user-specified ioasid (shared by all devices > > > > attached to this ioasid). Before attaching, the device should be > > > > first > > > > detached from the initial context; > > > > > > So, this step can implicitly but observably change the behaviour for > > > other devices in the group as well. I don't love that kind of > > > difficult to predict side effect, which is why I'm *still* not totally > > > convinced by the device-centric model. > > > > which side-effect is predicted here? The user anyway needs to be > > aware of such group restriction regardless whether it uses group > > or nongroup interface. > > Yes, exactly. And with a group interface it's obvious it has to > understand it. With the non-group interface, you can get to this > stage in ignorance of groups. It will even work as long as you are > lucky enough only to try with singleton-group devices. Then you try > it with two devices in the one group and doing (3) on device A will > implicitly change the DMA environment of device B. for non-group we can also document it obviously in uAPI that the user must understand group restriction and violating it will get failure when attaching to different IOAS's for devices in the same group. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/8] iommu/vt-d: Clean up unused PASID updating functions
Hi Fenghua, On 2021/9/21 3:23, Fenghua Yu wrote: update_pasid() and its call chain are currently unused in the tree because Thomas disabled the ENQCMD feature. The feature will be re-enabled shortly using a different approach and update_pasid() and its call chain will not be used in the new approach. Remove the useless functions. Signed-off-by: Fenghua Yu Reviewed-by: Tony Luck Thanks for this cleanup. I have queued it for v5.16. Best regards, baolu --- arch/x86/include/asm/fpu/api.h | 2 -- drivers/iommu/intel/svm.c | 24 +--- 2 files changed, 1 insertion(+), 25 deletions(-) diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h index 23bef08a8388..ca4d0dee1ecd 100644 --- a/arch/x86/include/asm/fpu/api.h +++ b/arch/x86/include/asm/fpu/api.h @@ -106,6 +106,4 @@ extern int cpu_has_xfeatures(u64 xfeatures_mask, const char **feature_name); */ #define PASID_DISABLED0 -static inline void update_pasid(void) { } - #endif /* _ASM_X86_FPU_API_H */ diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index 0c228787704f..5b5d69b04fcc 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -505,21 +505,6 @@ int intel_svm_unbind_gpasid(struct device *dev, u32 pasid) return ret; } -static void _load_pasid(void *unused) -{ - update_pasid(); -} - -static void load_pasid(struct mm_struct *mm, u32 pasid) -{ - mutex_lock(&mm->context.lock); - - /* Update PASID MSR on all CPUs running the mm's tasks. */ - on_each_cpu_mask(mm_cpumask(mm), _load_pasid, NULL, true); - - mutex_unlock(&mm->context.lock); -} - static int intel_svm_alloc_pasid(struct device *dev, struct mm_struct *mm, unsigned int flags) { @@ -614,10 +599,6 @@ static struct iommu_sva *intel_svm_bind_mm(struct intel_iommu *iommu, if (ret) goto free_sdev; - /* The newly allocated pasid is loaded to the mm. */ - if (!(flags & SVM_FLAG_SUPERVISOR_MODE) && list_empty(&svm->devs)) - load_pasid(mm, svm->pasid); - list_add_rcu(&sdev->list, &svm->devs); success: return &sdev->sva; @@ -670,11 +651,8 @@ static int intel_svm_unbind_mm(struct device *dev, u32 pasid) kfree_rcu(sdev, rcu); if (list_empty(&svm->devs)) { - if (svm->notifier.ops) { + if (svm->notifier.ops) mmu_notifier_unregister(&svm->notifier, mm); - /* Clear mm's pasid. */ - load_pasid(mm, PASID_DISABLED); - } pasid_private_remove(svm->pasid); /* We mandate that no page faults may be outstanding * for the PASID when intel_svm_unbind_mm() is called. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
+Robin. > From: Jason Gunthorpe > Sent: Thursday, September 23, 2021 8:22 PM > > On Thu, Sep 23, 2021 at 12:05:29PM +, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Thursday, September 23, 2021 7:27 PM > > > > > > On Thu, Sep 23, 2021 at 11:15:24AM +0100, Jean-Philippe Brucker wrote: > > > > > > > So we can only tell userspace "No_snoop is not supported" (provided > we > > > > even want to allow them to enable No_snoop). Users in control of > stage-1 > > > > tables can create non-cacheable mappings through MAIR attributes. > > > > > > My point is that ARM is using IOMMU_CACHE to control the overall > > > cachability of the DMA > > > > > > ie not specifying IOMMU_CACHE requires using the arch specific DMA > > > cache flushers. > > > > > > Intel never uses arch specifc DMA cache flushers, and instead is > > > abusing IOMMU_CACHE to mean IOMMU_BLOCK_NO_SNOOP on DMA > that > > > is always > > > cachable. > > > > it uses IOMMU_CACHE to force all DMAs to snoop, including those which > > has non_snoop flag and wouldn't snoop cache if iommu is disabled. > Nothing > > is blocked. > > I see it differently, on Intel the only way to bypass the cache with > DMA is to specify the no-snoop bit in the TLP. The IOMMU PTE flag we > are talking about tells the IOMMU to ignore the no snoop bit. > > Again, Intel arch in the kernel does not support the DMA cache flush > arch API and *DOES NOT* support incoherent DMA at all. > > ARM *does* implement the DMA cache flush arch API and is using > IOMMU_CACHE to control if the caller will, or will not call the cache > flushes. I still didn't fully understand this point after reading the code. Looking at dma-iommu its cache flush functions are all coded with below as the first check: if (dev_is_dma_coherent(dev) && !dev_is_untrusted(dev)) return; dev->dma_coherent is initialized upon firmware info, not decided by IOMMU_CACHE. i.e. it's not IOMMU_CACHE to decide whether cache flushes should be called. Probably the confusion comes from __iommu_dma_alloc_noncontiguous: if (!(ioprot & IOMMU_CACHE)) { struct scatterlist *sg; int i; for_each_sg(sgt->sgl, sg, sgt->orig_nents, i) arch_dma_prep_coherent(sg_page(sg), sg->length); } Here it makes more sense to be if (!coherent) {}. with above being corrected, I think all iommu drivers do associate IOMMU_CACHE to the snoop aspect: Intel: - either force snooping by ignoring snoop bit in TLP (IOMMU_CACHE) - or has snoop decided by TLP (!IOMMU_CACHE) ARM: - set to snoop format if IOMMU_CACHE - set to nonsnoop format if !IOMMU_CACHE (in both cases TLP snoop bit is ignored?) Other archs - ignore IOMMU_CACHE as cache is always snooped via their IOMMUs > > This is fundamentally different from what Intel is using it for. > > > but why do you call it abuse? IOMMU_CACHE was first introduced for > > Intel platform: > > IMHO ARM changed the meaning when Robin linked IOMMU_CACHE to > dma_is_coherent stuff. At that point it became linked to 'do I need to > call arch cache flushers or not'. > I didn't identify the exact commit for above meaning change. Robin, could you help share some thoughts here? Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
On Fri, Sep 24, 2021 at 08:39:24AM -0700, Luck, Tony wrote: > If you have ctags installed then a ctrl-] on that > __fixup_pasid_exception() will take you to the function with > the comments. No electron microscope needed. I to use ctags, but when reading the #GP handler, this is a whole different file. Also, I don't find any of those comments explaining the not-our-#GP-but-harmless-cycle issue. The current->has_valid_pasid one comes close, but just misses it. But really the place to put this is in the #GP handler itself so we don't have to dig through every call there to figure out how it's supposed to work. > + > +/* > + * Try to figure out if there is a PASID MSR value to propagate to the > + * thread taking the #GP. > + */ > +bool __fixup_pasid_exception(void) > +{ > + u32 pasid; > + > + /* > +* This function is called only when this #GP was triggered from user > +* space. So the mm cannot be NULL. > +*/ > + pasid = current->mm->pasid; > + > + /* If no PASID is allocated, there is nothing to propagate. */ > + if (pasid == PASID_DISABLED) > + return false; > + > + /* > +* If the current task already has a valid PASID MSR, then the #GP > +* fault must be for some non-ENQCMD related reason. > +*/ > + if (current->has_valid_pasid) > + return false; > + > + /* Fix up the MSR by the PASID in the mm. */ > + fpu__pasid_write(pasid); > + current->has_valid_pasid = 1; > + > + return true; > +} > > -Tony ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 04/20] iommu: Add iommu_device_get_info interface
Hi David, On 2021/9/29 10:52, David Gibson wrote: On Sun, Sep 19, 2021 at 02:38:32PM +0800, Liu Yi L wrote: From: Lu Baolu This provides an interface for upper layers to get the per-device iommu attributes. int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *data); That fact that this interface doesn't let you know how to size the data buffer, other than by just knowing the right size for each attr concerns me. We plan to address this by following the comments here. https://lore.kernel.org/linux-iommu/20210921161930.gp327...@nvidia.com/ Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 04/20] iommu: Add iommu_device_get_info interface
On 2021/9/29 17:25, Lu Baolu wrote: Hi David, On 2021/9/29 10:52, David Gibson wrote: On Sun, Sep 19, 2021 at 02:38:32PM +0800, Liu Yi L wrote: From: Lu Baolu This provides an interface for upper layers to get the per-device iommu attributes. int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *data); That fact that this interface doesn't let you know how to size the data buffer, other than by just knowing the right size for each attr concerns me. We plan to address this by following the comments here. https://lore.kernel.org/linux-iommu/20210921161930.gp327...@nvidia.com/ And Christoph gave another option as well. https://lore.kernel.org/linux-iommu/20210922050746.ga12...@lst.de/ Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] [RFC] qcom_scm: hide Kconfig symbol
On Mon, Sep 27, 2021 at 05:22:13PM +0200, Arnd Bergmann wrote: > From: Arnd Bergmann > > Now that SCM can be a loadable module, we have to add another > dependency to avoid link failures when ipa or adreno-gpu are > built-in: > > aarch64-linux-ld: drivers/net/ipa/ipa_main.o: in function `ipa_probe': > ipa_main.c:(.text+0xfc4): undefined reference to `qcom_scm_is_available' > > ld.lld: error: undefined symbol: qcom_scm_is_available > >>> referenced by adreno_gpu.c > >>> gpu/drm/msm/adreno/adreno_gpu.o:(adreno_zap_shader_load) in > >>> archive drivers/built-in.a > > This can happen when CONFIG_ARCH_QCOM is disabled and we don't select > QCOM_MDT_LOADER, but some other module selects QCOM_SCM. Ideally we'd > use a similar dependency here to what we have for QCOM_RPROC_COMMON, > but that causes dependency loops from other things selecting QCOM_SCM. > > This appears to be an endless problem, so try something different this > time: > > - CONFIG_QCOM_SCM becomes a hidden symbol that nothing 'depends on' >but that is simply selected by all of its users > > - All the stubs in include/linux/qcom_scm.h can go away > > - arm-smccc.h needs to provide a stub for __arm_smccc_smc() to >allow compile-testing QCOM_SCM on all architectures. > > - To avoid a circular dependency chain involving RESET_CONTROLLER >and PINCTRL_SUNXI, change the 'depends on RESET_CONTROLLER' in >the latter one to 'select'. > > The last bit is rather annoying, as drivers should generally never > 'select' another subsystem, and about half the users of the reset > controller interface do this anyway. > > Nevertheless, this version seems to pass all my randconfig tests > and is more robust than any of the prior versions. > > Comments? > > Signed-off-by: Arnd Bergmann > --- > drivers/firmware/Kconfig| 4 +- > drivers/gpu/drm/msm/Kconfig | 4 +- > drivers/iommu/Kconfig | 2 +- > drivers/media/platform/Kconfig | 2 +- > drivers/mmc/host/Kconfig| 2 +- > drivers/net/ipa/Kconfig | 1 + > drivers/net/wireless/ath/ath10k/Kconfig | 2 +- > drivers/pinctrl/qcom/Kconfig| 3 +- > drivers/pinctrl/sunxi/Kconfig | 6 +-- > include/linux/arm-smccc.h | 10 > include/linux/qcom_scm.h| 71 - > 11 files changed, 23 insertions(+), 84 deletions(-) > > diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig > index 220a58cf0a44..f7dd82ef0b9c 100644 > --- a/drivers/firmware/Kconfig > +++ b/drivers/firmware/Kconfig > @@ -203,9 +203,7 @@ config INTEL_STRATIX10_RSU > Say Y here if you want Intel RSU support. > > config QCOM_SCM > - tristate "Qcom SCM driver" > - depends on ARM || ARM64 > - depends on HAVE_ARM_SMCCC > + tristate > select RESET_CONTROLLER > > config QCOM_SCM_DOWNLOAD_MODE_DEFAULT > diff --git a/drivers/gpu/drm/msm/Kconfig b/drivers/gpu/drm/msm/Kconfig > index e9c6af78b1d7..3ddf739a6f9b 100644 > --- a/drivers/gpu/drm/msm/Kconfig > +++ b/drivers/gpu/drm/msm/Kconfig > @@ -17,7 +17,7 @@ config DRM_MSM > select DRM_SCHED > select SHMEM > select TMPFS > - select QCOM_SCM if ARCH_QCOM > + select QCOM_SCM > select WANT_DEV_COREDUMP > select SND_SOC_HDMI_CODEC if SND_SOC > select SYNC_FILE > @@ -55,7 +55,7 @@ config DRM_MSM_GPU_SUDO > > config DRM_MSM_HDMI_HDCP > bool "Enable HDMI HDCP support in MSM DRM driver" > - depends on DRM_MSM && QCOM_SCM > + depends on DRM_MSM > default y > help > Choose this option to enable HDCP state machine > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig > index 124c41adeca1..989c83acbfee 100644 > --- a/drivers/iommu/Kconfig > +++ b/drivers/iommu/Kconfig > @@ -308,7 +308,7 @@ config APPLE_DART > config ARM_SMMU > tristate "ARM Ltd. System MMU (SMMU) Support" > depends on ARM64 || ARM || (COMPILE_TEST && !GENERIC_ATOMIC64) > - depends on QCOM_SCM || !QCOM_SCM #if QCOM_SCM=m this can't be =y > + select QCOM_SCM > select IOMMU_API > select IOMMU_IO_PGTABLE_LPAE > select ARM_DMA_USE_IOMMU if ARM I don't want to get in the way of this patch because I'm also tired of the randconfig failures caused by QCOM_SCM. However, ARM_SMMU is applicable to a wide variety of (non-qcom) SoCs and so it seems a shame to require the QCOM_SCM code to be included for all of those when it's not strictly needed at all. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
On Fri, Sep 24, 2021 at 04:03:53PM -0700, Andy Lutomirski wrote: > I think the perfect and the good are a bit confused here. If we go for > "good", then we have an mm owning a PASID for its entire lifetime. If > we want "perfect", then we should actually do it right: teach the > kernel to update an entire mm's PASID setting all at once. This isn't > *that* hard -- it involves two things: > > 1. The context switch code needs to resync PASID. Unfortunately, this > adds some overhead to every context switch, although a static_branch > could minimize it for non-PASID users. > 2. A change to an mm's PASID needs to sent an IPI, but that IPI can't > touch FPU state. So instead the IPI should use task_work_add() to > make sure PASID gets resynced. What do we need 1 for? Any PASID change can be achieved using 2 no? Basically, call task_work_add() on all relevant tasks [1], then IPI spray the current running of those and presto. [1] it is nigh on impossible to find all tasks sharing an mm in any sane way due to CLONE_MM && !CLONE_THREAD. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] [RFC] qcom_scm: hide Kconfig symbol
On Wed, Sep 29, 2021 at 11:51 AM Will Deacon wrote: > On Mon, Sep 27, 2021 at 05:22:13PM +0200, Arnd Bergmann wrote: > > > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig > > index 124c41adeca1..989c83acbfee 100644 > > --- a/drivers/iommu/Kconfig > > +++ b/drivers/iommu/Kconfig > > @@ -308,7 +308,7 @@ config APPLE_DART > > config ARM_SMMU > > tristate "ARM Ltd. System MMU (SMMU) Support" > > depends on ARM64 || ARM || (COMPILE_TEST && !GENERIC_ATOMIC64) > > - depends on QCOM_SCM || !QCOM_SCM #if QCOM_SCM=m this can't be =y > > + select QCOM_SCM > > select IOMMU_API > > select IOMMU_IO_PGTABLE_LPAE > > select ARM_DMA_USE_IOMMU if ARM > > I don't want to get in the way of this patch because I'm also tired of the > randconfig failures caused by QCOM_SCM. However, ARM_SMMU is applicable to > a wide variety of (non-qcom) SoCs and so it seems a shame to require the > QCOM_SCM code to be included for all of those when it's not strictly needed > at all. Good point, I agree that needs to be fixed. I think this additional change should do the trick: --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -308,7 +308,6 @@ config APPLE_DART config ARM_SMMU tristate "ARM Ltd. System MMU (SMMU) Support" depends on ARM64 || ARM || (COMPILE_TEST && !GENERIC_ATOMIC64) - select QCOM_SCM select IOMMU_API select IOMMU_IO_PGTABLE_LPAE select ARM_DMA_USE_IOMMU if ARM @@ -438,7 +437,7 @@ config QCOM_IOMMU # Note: iommu drivers cannot (yet?) be built as modules bool "Qualcomm IOMMU Support" depends on ARCH_QCOM || (COMPILE_TEST && !GENERIC_ATOMIC64) - depends on QCOM_SCM=y + select QCOM_SCM select IOMMU_API select IOMMU_IO_PGTABLE_LPAE select ARM_DMA_USE_IOMMU I'll see if that causes any problems for the randconfig builds. Arnd ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 17/20] iommu/iommufd: Report iova range to userspace
> From: Jean-Philippe Brucker > Sent: Wednesday, September 22, 2021 10:49 PM > > On Sun, Sep 19, 2021 at 02:38:45PM +0800, Liu Yi L wrote: > > [HACK. will fix in v2] > > > > IOVA range is critical info for userspace to manage DMA for an I/O address > > space. This patch reports the valid iova range info of a given device. > > > > Due to aforementioned hack, this info comes from the hacked vfio type1 > > driver. To follow the same format in vfio, we also introduce a cap chain > > format in IOMMU_DEVICE_GET_INFO to carry the iova range info. > [...] > > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h > > index 49731be71213..f408ad3c8ade 100644 > > --- a/include/uapi/linux/iommu.h > > +++ b/include/uapi/linux/iommu.h > > @@ -68,6 +68,7 @@ > > *+---++ > > *... > > * @addr_width:the address width of supported I/O address spaces. > > + * @cap_offset: Offset within info struct of first cap > > * > > * Availability: after device is bound to iommufd > > */ > > @@ -77,9 +78,11 @@ struct iommu_device_info { > > #define IOMMU_DEVICE_INFO_ENFORCE_SNOOP(1 << 0) /* IOMMU > enforced snoop */ > > #define IOMMU_DEVICE_INFO_PGSIZES (1 << 1) /* supported page > sizes */ > > #define IOMMU_DEVICE_INFO_ADDR_WIDTH (1 << 2) /* > addr_wdith field valid */ > > +#define IOMMU_DEVICE_INFO_CAPS (1 << 3) /* info > supports cap chain */ > > __u64 dev_cookie; > > __u64 pgsize_bitmap; > > __u32 addr_width; > > + __u32 cap_offset; > > We can also add vendor-specific page table and PASID table properties as > capabilities, otherwise we'll need giant unions in the iommu_device_info > struct. That made me wonder whether pgsize and addr_width should also > be > separate capabilities for consistency, but this way might be good enough. > There won't be many more generic capabilities. I have "output address > width" what do you mean by "output address width"? Is it the output address of stage-1 translation? > and "PASID width", the rest is specific to Arm and SMMU table > formats. When coming to nested translation support, the stage-1 related info are likely to be vendor-specific, and will be reported in cap chain. Regards, Yi Liu > Thanks, > Jean > > > }; > > > > #define IOMMU_DEVICE_GET_INFO _IO(IOMMU_TYPE, IOMMU_BASE + > 1) > > -- > > 2.25.1 > > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
> From: Jean-Philippe Brucker > Sent: Wednesday, September 22, 2021 9:45 PM > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > This patch adds IOASID allocation/free interface per iommufd. When > > allocating an IOASID, userspace is expected to specify the type and > > format information for the target I/O page table. > > > > This RFC supports only one type > (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > semantics. For this type the user should specify the addr_width of > > the I/O address space and whether the I/O page table is created in > > an iommu enfore_snoop format. enforce_snoop must be true at this > point, > > as the false setting requires additional contract with KVM on handling > > WBINVD emulation, which can be added later. > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > > for what formats can be specified when allocating an IOASID. > > > > Open: > > - Devices on PPC platform currently use a different iommu driver in vfio. > > Per previous discussion they can also use vfio type1v2 as long as there > > is a way to claim a specific iova range from a system-wide address space. > > Is this the reason for passing addr_width to IOASID_ALLOC? I didn't get > what it's used for or why it's mandatory. But for PPC it sounds like it > should be an address range instead of an upper limit? yes, as this open described, it may need to be a range. But not sure if PPC requires multiple ranges or just one range. Perhaps, David may guide there. Regards, Yi Liu > Thanks, > Jean > > > This requirement doesn't sound PPC specific, as addr_width for pci > devices > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't > > adopted this design yet. We hope to have formal alignment in v1 > discussion > > and then decide how to incorporate it in v2. > > > > - Currently ioasid term has already been used in the kernel > (drivers/iommu/ > > ioasid.c) to represent the hardware I/O address space ID in the wire. It > > covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub- > Stream > > ID). We need find a way to resolve the naming conflict between the > hardware > > ID and software handle. One option is to rename the existing ioasid to be > > pasid or ssid, given their full names still sound generic. Appreciate more > > thoughts on this open! ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 17/20] iommu/iommufd: Report iova range to userspace
On Wed, Sep 29, 2021 at 10:44:01AM +, Liu, Yi L wrote: > > From: Jean-Philippe Brucker > > Sent: Wednesday, September 22, 2021 10:49 PM > > > > On Sun, Sep 19, 2021 at 02:38:45PM +0800, Liu Yi L wrote: > > > [HACK. will fix in v2] > > > > > > IOVA range is critical info for userspace to manage DMA for an I/O address > > > space. This patch reports the valid iova range info of a given device. > > > > > > Due to aforementioned hack, this info comes from the hacked vfio type1 > > > driver. To follow the same format in vfio, we also introduce a cap chain > > > format in IOMMU_DEVICE_GET_INFO to carry the iova range info. > > [...] > > > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h > > > index 49731be71213..f408ad3c8ade 100644 > > > --- a/include/uapi/linux/iommu.h > > > +++ b/include/uapi/linux/iommu.h > > > @@ -68,6 +68,7 @@ > > > * +---++ > > > * ... > > > * @addr_width:the address width of supported I/O address spaces. > > > + * @cap_offset: Offset within info struct of first cap > > > * > > > * Availability: after device is bound to iommufd > > > */ > > > @@ -77,9 +78,11 @@ struct iommu_device_info { > > > #define IOMMU_DEVICE_INFO_ENFORCE_SNOOP (1 << 0) /* IOMMU > > enforced snoop */ > > > #define IOMMU_DEVICE_INFO_PGSIZES(1 << 1) /* supported page > > sizes */ > > > #define IOMMU_DEVICE_INFO_ADDR_WIDTH (1 << 2) /* > > addr_wdith field valid */ > > > +#define IOMMU_DEVICE_INFO_CAPS (1 << 3) /* info > > supports cap chain */ > > > __u64 dev_cookie; > > > __u64 pgsize_bitmap; > > > __u32 addr_width; > > > + __u32 cap_offset; > > > > We can also add vendor-specific page table and PASID table properties as > > capabilities, otherwise we'll need giant unions in the iommu_device_info > > struct. That made me wonder whether pgsize and addr_width should also > > be > > separate capabilities for consistency, but this way might be good enough. > > There won't be many more generic capabilities. I have "output address > > width" > > what do you mean by "output address width"? Is it the output address > of stage-1 translation? Yes, so the guest knows the size of GPA it can write into the page table. For Arm SMMU the GPA size is determined by both the SMMU implementation and the host kernel configuration. But maybe that could also be vendor-specific, if other architectures don't need to communicate it. > > > and "PASID width", the rest is specific to Arm and SMMU table > > formats. > > When coming to nested translation support, the stage-1 related info are > likely to be vendor-specific, and will be reported in cap chain. Agreed Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 09/12] media: mtk-vcodec: Get rid of mtk_smi_larb_get/put
On 29.09.21 03:37, Yong Wu wrote: MediaTek IOMMU has already added the device_link between the consumer and smi-larb device. If the vcodec device call the pm_runtime_get_sync, the smi-larb's pm_runtime_get_sync also be called automatically. CC: Tiffany Lin CC: Irui Wang Signed-off-by: Yong Wu Reviewed-by: Evan Green Acked-by: Tiffany Lin Reviewed-by: Dafna Hirschfeld --- .../platform/mtk-vcodec/mtk_vcodec_dec_pm.c | 37 +++- .../platform/mtk-vcodec/mtk_vcodec_drv.h | 3 -- .../platform/mtk-vcodec/mtk_vcodec_enc.c | 1 - .../platform/mtk-vcodec/mtk_vcodec_enc_pm.c | 44 +++ 4 files changed, 10 insertions(+), 75 deletions(-) diff --git a/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c b/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c index 6038db96f71c..d0bf9aa3b29d 100644 --- a/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c +++ b/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c @@ -8,14 +8,12 @@ #include #include #include -#include #include "mtk_vcodec_dec_pm.h" #include "mtk_vcodec_util.h" int mtk_vcodec_init_dec_pm(struct mtk_vcodec_dev *mtkdev) { - struct device_node *node; struct platform_device *pdev; struct mtk_vcodec_pm *pm; struct mtk_vcodec_clk *dec_clk; @@ -26,18 +24,7 @@ int mtk_vcodec_init_dec_pm(struct mtk_vcodec_dev *mtkdev) pm = &mtkdev->pm; pm->mtkdev = mtkdev; dec_clk = &pm->vdec_clk; - node = of_parse_phandle(pdev->dev.of_node, "mediatek,larb", 0); - if (!node) { - mtk_v4l2_err("of_parse_phandle mediatek,larb fail!"); - return -1; - } - pdev = of_find_device_by_node(node); - of_node_put(node); - if (WARN_ON(!pdev)) { - return -1; - } - pm->larbvdec = &pdev->dev; pdev = mtkdev->plat_dev; pm->dev = &pdev->dev; @@ -47,14 +34,11 @@ int mtk_vcodec_init_dec_pm(struct mtk_vcodec_dev *mtkdev) dec_clk->clk_info = devm_kcalloc(&pdev->dev, dec_clk->clk_num, sizeof(*clk_info), GFP_KERNEL); - if (!dec_clk->clk_info) { - ret = -ENOMEM; - goto put_device; - } + if (!dec_clk->clk_info) + return -ENOMEM; } else { mtk_v4l2_err("Failed to get vdec clock count"); - ret = -EINVAL; - goto put_device; + return -EINVAL; } for (i = 0; i < dec_clk->clk_num; i++) { @@ -63,29 +47,24 @@ int mtk_vcodec_init_dec_pm(struct mtk_vcodec_dev *mtkdev) "clock-names", i, &clk_info->clk_name); if (ret) { mtk_v4l2_err("Failed to get clock name id = %d", i); - goto put_device; + return ret; } clk_info->vcodec_clk = devm_clk_get(&pdev->dev, clk_info->clk_name); if (IS_ERR(clk_info->vcodec_clk)) { mtk_v4l2_err("devm_clk_get (%d)%s fail", i, clk_info->clk_name); - ret = PTR_ERR(clk_info->vcodec_clk); - goto put_device; + return PTR_ERR(clk_info->vcodec_clk); } } pm_runtime_enable(&pdev->dev); return 0; -put_device: - put_device(pm->larbvdec); - return ret; } void mtk_vcodec_release_dec_pm(struct mtk_vcodec_dev *dev) { pm_runtime_disable(dev->pm.dev); - put_device(dev->pm.larbvdec); } Now that functions only do 'pm_runtime_disable(dev->pm.dev);' so it will be more readable to remove the function mtk_vcodec_release_dec_pm and replace with pm_runtime_disable(dev->pm.dev); Same for the 'enc' equivalent. Thanks, Dafna int mtk_vcodec_dec_pw_on(struct mtk_vcodec_pm *pm) @@ -122,11 +101,6 @@ void mtk_vcodec_dec_clock_on(struct mtk_vcodec_pm *pm) } } - ret = mtk_smi_larb_get(pm->larbvdec); - if (ret) { - mtk_v4l2_err("mtk_smi_larb_get larbvdec fail %d", ret); - goto error; - }> return; error: @@ -139,7 +113,6 @@ void mtk_vcodec_dec_clock_off(struct mtk_vcodec_pm *pm) struct mtk_vcodec_clk *dec_clk = &pm->vdec_clk; int i = 0; - mtk_smi_larb_put(pm->larbvdec); for (i = dec_clk->clk_num - 1; i >= 0; i--) clk_disable_unprepare(dec_clk->clk_info[i].vcodec_clk); } diff --git a/drivers/media/platform/mtk-vcodec/mtk_vcodec_drv.h b/drivers/media/platform/mtk-vcodec/mtk_vcodec_drv.h index c6c7672fecfb..64b73dd880ce 100644 --- a/drivers/media/platform/mtk-vcodec/mtk_vcodec_drv.h +++ b/drivers/media/platform/mtk-vcodec/mtk_vcodec_drv.h @@ -189,10 +189,7 @@ struct mtk_vcodec_clk { */ struct mtk_vcodec_pm { struc
Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
On Wed, Sep 29, 2021 at 09:08:25AM +0200, Cornelia Huck wrote: > On Wed, Sep 29 2021, "Tian, Kevin" wrote: > > >> From: David Gibson > >> Sent: Wednesday, September 29, 2021 10:44 AM > >> > >> > One alternative option is to arrange device nodes in sub-directories > >> > based > >> > on the device type. But doing so also adds one trouble to userspace. The > >> > current vfio uAPI is designed to have the user query device type via > >> > VFIO_DEVICE_GET_INFO after opening the device. With this option the user > >> > instead needs to figure out the device type before opening the device, to > >> > identify the sub-directory. > >> > >> Wouldn't this be up to the operator / configuration, rather than the > >> actual software though? I would assume that typically the VFIO > >> program would be pointed at a specific vfio device node file to use, > >> e.g. > >>my-vfio-prog -d /dev/vfio/pci/:0a:03.1 > >> > >> Or more generally, if you're expecting userspace to know a name in a > >> uniqu pattern, they can equally well know a "type/name" pair. > >> > > > > You are correct. Currently: > > > > -device, vfio-pci,host=:BB:DD.F > > -device, vfio-pci,sysfdev=/sys/bus/pci/devices/ :BB:DD.F > > -device, vfio-platform,sysdev=/sys/bus/platform/devices/PNP0103:00 > > > > above is definitely type/name information to find the related node. > > > > Actually even for Jason's proposal we still need such information to > > identify the sysfs path. > > > > Then I feel type-based sub-directory does work. Adding another link > > to sysfs sounds unnecessary now. But I'm not sure whether we still > > want to create /dev/vfio/devices/vfio0 thing and related udev rule > > thing that you pointed out in another mail. > > Still reading through this whole thread, but type-based subdirectories > also make the most sense to me. I don't really see userspace wanting to > grab just any device and then figure out whether it is the device it was > looking for, but rather immediately go to a specific device or at least > a device of a specific type. Even so the kernel should not be creating this, that is a job for udev and some symlinks > Sequentially-numbered devices tend to become really unwieldy in my > experience if you are working on a system with loads of devices. If the user experiance is always to refer to the sysfs node as Kevin shows above then the user never sees the integer. It is very much like how the group number works already, programs always start at the sysfs, do the readlink thing on iommu_group and then get the group number to go to /dev/vfio/X So it is already the case that every piece of software can construct a sysfs path to the device, we are just changing from readlink(iommu_group) to readdir(vfio/vfio_device_XX) Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
On Wed, Sep 29, 2021 at 12:46:14PM +1000, da...@gibson.dropbear.id.au wrote: > On Tue, Sep 21, 2021 at 10:00:14PM -0300, Jason Gunthorpe wrote: > > On Wed, Sep 22, 2021 at 12:54:02AM +, Tian, Kevin wrote: > > > > From: Jason Gunthorpe > > > > Sent: Wednesday, September 22, 2021 12:01 AM > > > > > > > > > One open about how to organize the device nodes under > > > > /dev/vfio/devices/. > > > > > This RFC adopts a simple policy by keeping a flat layout with mixed > > > > devname > > > > > from all kinds of devices. The prerequisite of this model is that > > > > > devnames > > > > > from different bus types are unique formats: > > > > > > > > This isn't reliable, the devname should just be vfio0, vfio1, etc > > > > > > > > The userspace can learn the correct major/minor by inspecting the > > > > sysfs. > > > > > > > > This whole concept should disappear into the prior patch that adds the > > > > struct device in the first place, and I think most of the code here > > > > can be deleted once the struct device is used properly. > > > > > > > > > > Can you help elaborate above flow? This is one area where we need > > > more guidance. > > > > > > When Qemu accepts an option "-device vfio-pci,host=:BB:DD.F", > > > how does Qemu identify which vifo0/1/... is associated with the specified > > > :BB:DD.F? > > > > When done properly in the kernel the file: > > > > /sys/bus/pci/devices/:BB:DD.F/vfio/vfioX/dev > > > > Will contain the major:minor of the VFIO device. > > > > Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat > > that the major:minor matches. > > > > in the above pattern "pci" and ":BB:DD.FF" are the arguments passed > > to qemu. > > I thought part of the appeal of the device centric model was less > grovelling around in sysfs for information. Using type/address > directly in /dev seems simpler than having to dig around matching > things here. I would say more regular grovelling. Starting from a sysfs device directory and querying the VFIO cdev associated with it is much more normal than what happens today, which also includes passing sysfs information into an ioctl :\ > Note that this doesn't have to be done in kernel: you could have the > kernel just call them /dev/vfio/devices/vfio0, ... but add udev rules > that create symlinks from say /dev/vfio/pci/:BB:SS.F - > > ../devices/vfioXX based on the sysfs information. This is the right approach if people want to do this, but I'm not sure it is worth it given backwards compat requires the sysfs path as input. We may as well stick with sysfs as the command line interface for userspace tools. And I certainly don't want to see userspace tools trying to reverse a sysfs path into a /dev/ symlink name when they can directly and reliably learn the correct cdev from the sysfspath. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()
On Wed, Sep 29, 2021 at 03:25:54PM +1000, David Gibson wrote: > > +struct iommufd_device { > > + unsigned int id; > > + struct iommufd_ctx *ictx; > > + struct device *dev; /* always be the physical device */ > > + u64 dev_cookie; > > Why do you need both an 'id' and a 'dev_cookie'? Since they're both > unique, couldn't you just use the cookie directly as the index into > the xarray? ID is the kernel value in the xarray - xarray is much more efficient & safe with small kernel controlled values. dev_cookie is a user assigned value that may not be unique. It's purpose is to allow userspace to receive and event and go back to its structure. Most likely userspace will store a pointer here, but it is also possible userspace could not use it. It is a pretty normal pattern Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
On Wed, Sep 29, 2021 at 06:41:00AM +, Tian, Kevin wrote: > > From: David Gibson > > Sent: Wednesday, September 29, 2021 2:01 PM > > > > On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote: > > > This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind the > > vfio > > > device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface is > > provided > > > because it's implicitly done when the device fd is closed. > > > > > > In concept a vfio device can be bound to multiple iommufds, each hosting > > > a subset of I/O address spaces attached by this device. > > > > I really feel like this many<->many mapping between devices is going > > to be super-confusing, and therefore make it really hard to be > > confident we have all the rules right for proper isolation. > > Based on new discussion on group ownership part (patch06), I feel this > many<->many relationship will disappear. The context fd (either container > or iommufd) will uniquely mark the ownership on a physical device and > its group. With this design it's impractical to have one device bound > to multiple iommufds. That should be a requirement! We have no way to prove that two iommufds are the same security domain, so devices/groups cannot be shared. That is why the API I suggested takes in a struct file to ID the user security context. A group is accessible only from that single struct file and no more. If the first series goes the way I outlined then I think David's concern about security is strongly solved as the IOMMU layer is directly managing it with a very clear responsiblity and semantic. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
On Wed, Sep 29 2021 at 11:54, Peter Zijlstra wrote: > On Fri, Sep 24, 2021 at 04:03:53PM -0700, Andy Lutomirski wrote: >> I think the perfect and the good are a bit confused here. If we go for >> "good", then we have an mm owning a PASID for its entire lifetime. If >> we want "perfect", then we should actually do it right: teach the >> kernel to update an entire mm's PASID setting all at once. This isn't >> *that* hard -- it involves two things: >> >> 1. The context switch code needs to resync PASID. Unfortunately, this >> adds some overhead to every context switch, although a static_branch >> could minimize it for non-PASID users. > >> 2. A change to an mm's PASID needs to sent an IPI, but that IPI can't >> touch FPU state. So instead the IPI should use task_work_add() to >> make sure PASID gets resynced. > > What do we need 1 for? Any PASID change can be achieved using 2 no? > > Basically, call task_work_add() on all relevant tasks [1], then IPI > spray the current running of those and presto. > > [1] it is nigh on impossible to find all tasks sharing an mm in any sane > way due to CLONE_MM && !CLONE_THREAD. Why would we want any of that at all? Process starts, no PASID assigned. bind to device -> PASID is allocated and assigned to the mm some task of the process issues ENQCMD -> #GP -> write PASID MSR After that the PASID is saved and restored as part of the XSTATE and there is no extra overhead in context switch or return to user space. All tasks of the process which did never use ENQCMD don't care and their PASID xstate is in init state. There is absolutely no point in enforcing that all tasks of the process have the PASID activated immediately when it is assigned. If they need it they get it via the #GP fixup and everything just works. Looking at that patch again, none of this muck in fpu__pasid_write() is required at all. The whole exception fixup is: if (!user_mode(regs)) return false; if (!current->mm->pasid) return false; if (current->pasid_activated) return false; wrmsrl(MSR_IA32_PASID, current->mm->pasid); current->pasid_activated = true; return true; There is zero requirement to look at TIF_NEED_FPU_LOAD or fpregs_state_valid() simply because the #GP comes straight from user space which means the FPU registers contain the current tasks user space state. If TIF_NEED_FPU_LOAD would be set or fpregs_state_valid() would be false after the user_mode() check then this would simply be a bug somewhere else and has nothing to do with this PASID fixup. So no need for magic update_one_xstate_feature() wrappers, no concurrency concerns, nothing. It's that simple, really. Anything more complex is just a purely academic exercise which creates more problems than it solves. Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 17/20] iommu/iommufd: Report iova range to userspace
On Wed, Sep 29, 2021 at 01:07:56PM +0100, Jean-Philippe Brucker wrote: > Yes, so the guest knows the size of GPA it can write into the page table. > For Arm SMMU the GPA size is determined by both the SMMU implementation > and the host kernel configuration. But maybe that could also be > vendor-specific, if other architectures don't need to communicate it. I think there should be a dedicated query to return HW specific parmaters for a user page table format. Somehow I think there will be a lot of these. So 'user page table format arm smmu v1' can be queried to return its own unique struct that has everything needed to operate that format of page table. Userspace already needs to know how to form that specific HW PTEs, so processing a HW specific query is not a problem. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
On Wed, Sep 29, 2021 at 08:48:28AM +, Tian, Kevin wrote: > ARM: > - set to snoop format if IOMMU_CACHE > - set to nonsnoop format if !IOMMU_CACHE > (in both cases TLP snoop bit is ignored?) Where do you see this? I couldn't even find this functionality in the ARM HW manual?? What I saw is ARM linking the IOMMU_CACHE to a IO PTE bit that causes the cache coherence to be disabled, which is not ignoring no snoop. > I didn't identify the exact commit for above meaning change. > > Robin, could you help share some thoughts here? It is this: static int dma_info_to_prot(enum dma_data_direction dir, bool coherent, unsigned long attrs) { int prot = coherent ? IOMMU_CACHE : 0; Which sets IOMMU_CACHE based on: static void *iommu_dma_alloc(struct device *dev, size_t size, dma_addr_t *handle, gfp_t gfp, unsigned long attrs) { bool coherent = dev_is_dma_coherent(dev); int ioprot = dma_info_to_prot(DMA_BIDIRECTIONAL, coherent, attrs); Driving IOMMU_CACHE from dev_is_dma_coherent() has *NOTHING* to do with no-snoop TLPs and everything to do with the arch cache maintenance API Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
On Wed, Sep 29, 2021 at 04:35:19PM +1000, David Gibson wrote: > Yes, exactly. And with a group interface it's obvious it has to > understand it. With the non-group interface, you can get to this > stage in ignorance of groups. It will even work as long as you are > lucky enough only to try with singleton-group devices. Then you try > it with two devices in the one group and doing (3) on device A will > implicitly change the DMA environment of device B. The security model here says this is fine. This idea to put the iommu code in charge of security is quite clean, as I said in the other mail drivers attached to 'struct devices *' tell the iommu layer what they are are doing: iommu_set_device_dma_owner(dev, DMA_OWNER_KERNEL, NULL) iommu_set_device_dma_owner(dev, DMA_OWNER_SHARED, NULL) iommu_set_device_dma_owner(dev, DMA_OWNER_USERSPACE, group_file/iommu_file) And it decides if it is allowed. If device A is allowed to go to userspace then security wise it is deemed fine that B is impacted. That is what we have defined already today. This proposal does not free userpace from having to understand this! The iommu_group sysfs is still there and still must be understood. The *admin* the one responsible to understand the groups, not the applications. The admin has no idea what a group FD is - they should be looking at the sysfs and seeing the iommu_group directories. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
On Wed, Sep 29, 2021 at 12:38:35AM +, Tian, Kevin wrote: > /* If set the driver must call iommu_XX as the first action in probe() or > * before it attempts to do DMA > */ > bool suppress_dma_owner:1; It is not "attempts to do DMA" but more "operates the physical device in any away" Not having ownership means another entity could be using user space DMA to manipulate the device state and attack the integrity of the kernel's programming of the device. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] [RFC] qcom_scm: hide Kconfig symbol
On Wed 29 Sep 05:04 CDT 2021, Arnd Bergmann wrote: > On Wed, Sep 29, 2021 at 11:51 AM Will Deacon wrote: > > On Mon, Sep 27, 2021 at 05:22:13PM +0200, Arnd Bergmann wrote: > > > > > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig > > > index 124c41adeca1..989c83acbfee 100644 > > > --- a/drivers/iommu/Kconfig > > > +++ b/drivers/iommu/Kconfig > > > @@ -308,7 +308,7 @@ config APPLE_DART > > > config ARM_SMMU > > > tristate "ARM Ltd. System MMU (SMMU) Support" > > > depends on ARM64 || ARM || (COMPILE_TEST && !GENERIC_ATOMIC64) > > > - depends on QCOM_SCM || !QCOM_SCM #if QCOM_SCM=m this can't be =y > > > + select QCOM_SCM > > > select IOMMU_API > > > select IOMMU_IO_PGTABLE_LPAE > > > select ARM_DMA_USE_IOMMU if ARM > > > > I don't want to get in the way of this patch because I'm also tired of the > > randconfig failures caused by QCOM_SCM. However, ARM_SMMU is applicable to > > a wide variety of (non-qcom) SoCs and so it seems a shame to require the > > QCOM_SCM code to be included for all of those when it's not strictly needed > > at all. > > Good point, I agree that needs to be fixed. I think this additional > change should do the trick: > ARM_SMMU and QCOM_IOMMU are two separate implementations and both uses QCOM_SCM. So both of them should select QCOM_SCM. "Unfortunately" the Qualcomm portion of ARM_SMMU is builtin unconditionally, so going with something like select QCOM_SCM if ARCH_QCOM would still require the stubs in qcom_scm.h. Regards, Bjorn > --- a/drivers/iommu/Kconfig > +++ b/drivers/iommu/Kconfig > @@ -308,7 +308,6 @@ config APPLE_DART > config ARM_SMMU > tristate "ARM Ltd. System MMU (SMMU) Support" > depends on ARM64 || ARM || (COMPILE_TEST && !GENERIC_ATOMIC64) > - select QCOM_SCM > select IOMMU_API > select IOMMU_IO_PGTABLE_LPAE > select ARM_DMA_USE_IOMMU if ARM > @@ -438,7 +437,7 @@ config QCOM_IOMMU > # Note: iommu drivers cannot (yet?) be built as modules > bool "Qualcomm IOMMU Support" > depends on ARCH_QCOM || (COMPILE_TEST && !GENERIC_ATOMIC64) > - depends on QCOM_SCM=y > + select QCOM_SCM > select IOMMU_API > select IOMMU_IO_PGTABLE_LPAE > select ARM_DMA_USE_IOMMU > > I'll see if that causes any problems for the randconfig builds. > >Arnd ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/2] [v2] qcom_scm: hide Kconfig symbol
On Tue 28 Sep 02:50 CDT 2021, Arnd Bergmann wrote: > From: Arnd Bergmann > > Now that SCM can be a loadable module, we have to add another > dependency to avoid link failures when ipa or adreno-gpu are > built-in: > > aarch64-linux-ld: drivers/net/ipa/ipa_main.o: in function `ipa_probe': > ipa_main.c:(.text+0xfc4): undefined reference to `qcom_scm_is_available' > > ld.lld: error: undefined symbol: qcom_scm_is_available > >>> referenced by adreno_gpu.c > >>> gpu/drm/msm/adreno/adreno_gpu.o:(adreno_zap_shader_load) in > >>> archive drivers/built-in.a > > This can happen when CONFIG_ARCH_QCOM is disabled and we don't select > QCOM_MDT_LOADER, but some other module selects QCOM_SCM. Ideally we'd > use a similar dependency here to what we have for QCOM_RPROC_COMMON, > but that causes dependency loops from other things selecting QCOM_SCM. > > This appears to be an endless problem, so try something different this > time: > > - CONFIG_QCOM_SCM becomes a hidden symbol that nothing 'depends on' >but that is simply selected by all of its users > > - All the stubs in include/linux/qcom_scm.h can go away > > - arm-smccc.h needs to provide a stub for __arm_smccc_smc() to >allow compile-testing QCOM_SCM on all architectures. > > - To avoid a circular dependency chain involving RESET_CONTROLLER >and PINCTRL_SUNXI, drop the 'select RESET_CONTROLLER' statement. >According to my testing this still builds fine, and the QCOM >platform selects this symbol already. > > Acked-by: Kalle Valo > Signed-off-by: Arnd Bergmann > --- > Changes in v2: > - drop the 'select RESET_CONTROLLER' line, rather than adding > more of the same > --- > drivers/firmware/Kconfig| 5 +- > drivers/gpu/drm/msm/Kconfig | 4 +- > drivers/iommu/Kconfig | 2 +- > drivers/media/platform/Kconfig | 2 +- > drivers/mmc/host/Kconfig| 2 +- > drivers/net/ipa/Kconfig | 1 + > drivers/net/wireless/ath/ath10k/Kconfig | 2 +- > drivers/pinctrl/qcom/Kconfig| 3 +- > include/linux/arm-smccc.h | 10 > include/linux/qcom_scm.h| 71 - > 10 files changed, 20 insertions(+), 82 deletions(-) > > diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig > index 220a58cf0a44..cda7d7162cbb 100644 > --- a/drivers/firmware/Kconfig > +++ b/drivers/firmware/Kconfig > @@ -203,10 +203,7 @@ config INTEL_STRATIX10_RSU > Say Y here if you want Intel RSU support. > > config QCOM_SCM > - tristate "Qcom SCM driver" > - depends on ARM || ARM64 > - depends on HAVE_ARM_SMCCC > - select RESET_CONTROLLER > + tristate > > config QCOM_SCM_DOWNLOAD_MODE_DEFAULT > bool "Qualcomm download mode enabled by default" > diff --git a/drivers/gpu/drm/msm/Kconfig b/drivers/gpu/drm/msm/Kconfig > index e9c6af78b1d7..3ddf739a6f9b 100644 > --- a/drivers/gpu/drm/msm/Kconfig > +++ b/drivers/gpu/drm/msm/Kconfig > @@ -17,7 +17,7 @@ config DRM_MSM > select DRM_SCHED > select SHMEM > select TMPFS > - select QCOM_SCM if ARCH_QCOM > + select QCOM_SCM > select WANT_DEV_COREDUMP > select SND_SOC_HDMI_CODEC if SND_SOC > select SYNC_FILE > @@ -55,7 +55,7 @@ config DRM_MSM_GPU_SUDO > > config DRM_MSM_HDMI_HDCP > bool "Enable HDMI HDCP support in MSM DRM driver" > - depends on DRM_MSM && QCOM_SCM > + depends on DRM_MSM > default y > help > Choose this option to enable HDCP state machine > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig > index 124c41adeca1..989c83acbfee 100644 > --- a/drivers/iommu/Kconfig > +++ b/drivers/iommu/Kconfig > @@ -308,7 +308,7 @@ config APPLE_DART > config ARM_SMMU > tristate "ARM Ltd. System MMU (SMMU) Support" > depends on ARM64 || ARM || (COMPILE_TEST && !GENERIC_ATOMIC64) > - depends on QCOM_SCM || !QCOM_SCM #if QCOM_SCM=m this can't be =y > + select QCOM_SCM > select IOMMU_API > select IOMMU_IO_PGTABLE_LPAE > select ARM_DMA_USE_IOMMU if ARM As noted in the RFC, I think you also need to fix QCOM_IOMMU. In particular (iirc) since all the users of the iommu might be modules, which would prevent QCOM_IOMMU from being selected. The rest looks good. Regards, Bjorn > diff --git a/drivers/media/platform/Kconfig b/drivers/media/platform/Kconfig > index 157c924686e4..80321e03809a 100644 > --- a/drivers/media/platform/Kconfig > +++ b/drivers/media/platform/Kconfig > @@ -565,7 +565,7 @@ config VIDEO_QCOM_VENUS > depends on VIDEO_DEV && VIDEO_V4L2 && QCOM_SMEM > depends on (ARCH_QCOM && IOMMU_DMA) || COMPILE_TEST > select QCOM_MDT_LOADER if ARCH_QCOM > - select QCOM_SCM if ARCH_QCOM > + select QCOM_SCM > select VIDEOBUF2_DMA_CONTIG > select V4L2_MEM2MEM_DEV > help > diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig > index 71313961cc54..95b3511b05
Re: [PATCH v8 03/12] iommu/mediatek: Add probe_defer for smi-larb
On 29.09.21 03:37, Yong Wu wrote: Prepare for adding device_link. The iommu consumer should use device_link to connect with the smi-larb(supplier). then the smi-larb should run before the iommu consumer. Here we delay the iommu driver until the smi driver is ready, then all the iommu consumers always are after the smi driver. When there is no this patch, if some consumer drivers run before smi-larb, the supplier link_status is DL_DEV_NO_DRIVER(0) in the device_link_add, then device_links_driver_bound will use WARN_ON to complain that the link_status of supplier is not right. device_is_bound may be more elegant here. but it is not allowed to EXPORT from https://lore.kernel.org/patchwork/patch/1334670/. Signed-off-by: Yong Wu Tested-by: Frank Wunderlich # BPI-R2/MT7623 --- drivers/iommu/mtk_iommu.c| 2 +- drivers/iommu/mtk_iommu_v1.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c index d837adfd1da5..d5848f78a677 100644 --- a/drivers/iommu/mtk_iommu.c +++ b/drivers/iommu/mtk_iommu.c @@ -844,7 +844,7 @@ static int mtk_iommu_probe(struct platform_device *pdev) id = i; plarbdev = of_find_device_by_node(larbnode); - if (!plarbdev) { + if (!plarbdev || !plarbdev->dev.driver) { of_node_put(larbnode); return -EPROBE_DEFER; if plarbdev is null doesn't that mean that the device does not exist? so we should return -ENODEV in that case? thanks, Dafna } diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c index 1467ba1e4417..4d7809432239 100644 --- a/drivers/iommu/mtk_iommu_v1.c +++ b/drivers/iommu/mtk_iommu_v1.c @@ -602,7 +602,7 @@ static int mtk_iommu_probe(struct platform_device *pdev) } plarbdev = of_find_device_by_node(larbnode); - if (!plarbdev) { + if (!plarbdev || !plarbdev->dev.driver) { of_node_put(larbnode); return -EPROBE_DEFER; } ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
> There is zero requirement to look at TIF_NEED_FPU_LOAD or > fpregs_state_valid() simply because the #GP comes straight from user > space which means the FPU registers contain the current tasks user space > state. Just to double confirm ... there is no point in the #GP handler up to this point where pre-emption can occur? -Tony ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
On 9/28/21 16:10, Luck, Tony wrote: Moving beyond pseudo-code and into compiles-but-probably-broken-code. The intent of the functions below is that Fenghua should be able to do: void fpu__pasid_write(u32 pasid) { u64 msr_val = pasid | MSR_IA32_PASID_VALID; struct ia32_pasid_state *addr; addr = begin_update_one_xsave_feature(current, XFEATURE_PASID, true); addr->pasid = msr_val; finish_update_one_xsave_feature(current); } This gets gnarly because we would presumably like to optimize the case where we can do the update directly in registers. I wonder if we can do it with a bit of macro magic in a somewhat generic way: typedef fpu__pasid_type u32; static inline void fpu__set_pasid_in_register(const u32 *value) { wrmsr(...); } #define DEFINE_FPU_HELPER(name) \ static inline void fpu__set_##name(const fpu__##name##_type *val) \ { \ fpregs_lock(); \ if (should write in memory) { \ ->xfeatures |= XFEATURE_##name; \ ptr = get_xsave_addr(...); \ memcpy(ptr, val, sizeof(*val)); \ __fpu_invalidate_fpregs_state(...); \ } else { \ fpu__set_##name##_in_register(val); \ } \ } ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
On 9/29/21 05:28, Thomas Gleixner wrote: On Wed, Sep 29 2021 at 11:54, Peter Zijlstra wrote: On Fri, Sep 24, 2021 at 04:03:53PM -0700, Andy Lutomirski wrote: I think the perfect and the good are a bit confused here. If we go for "good", then we have an mm owning a PASID for its entire lifetime. If we want "perfect", then we should actually do it right: teach the kernel to update an entire mm's PASID setting all at once. This isn't *that* hard -- it involves two things: 1. The context switch code needs to resync PASID. Unfortunately, this adds some overhead to every context switch, although a static_branch could minimize it for non-PASID users. 2. A change to an mm's PASID needs to sent an IPI, but that IPI can't touch FPU state. So instead the IPI should use task_work_add() to make sure PASID gets resynced. What do we need 1 for? Any PASID change can be achieved using 2 no? Basically, call task_work_add() on all relevant tasks [1], then IPI spray the current running of those and presto. [1] it is nigh on impossible to find all tasks sharing an mm in any sane way due to CLONE_MM && !CLONE_THREAD. Why would we want any of that at all? Process starts, no PASID assigned. bind to device -> PASID is allocated and assigned to the mm some task of the process issues ENQCMD -> #GP -> write PASID MSR After that the PASID is saved and restored as part of the XSTATE and there is no extra overhead in context switch or return to user space. All tasks of the process which did never use ENQCMD don't care and their PASID xstate is in init state. There is absolutely no point in enforcing that all tasks of the process have the PASID activated immediately when it is assigned. If they need it they get it via the #GP fixup and everything just works. Looking at that patch again, none of this muck in fpu__pasid_write() is required at all. The whole exception fixup is: if (!user_mode(regs)) return false; if (!current->mm->pasid) return false; if (current->pasid_activated) return false; <-- preemption or BH here: kaboom. wrmsrl(MSR_IA32_PASID, current->mm->pasid); This needs the actual sane fpstate writing helper -- see other email. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
On Wed, Sep 29, 2021 at 09:58:22AM -0700, Andy Lutomirski wrote: > On 9/28/21 16:10, Luck, Tony wrote: > > Moving beyond pseudo-code and into compiles-but-probably-broken-code. > > > > > > The intent of the functions below is that Fenghua should be able to > > do: > > > > void fpu__pasid_write(u32 pasid) > > { > > u64 msr_val = pasid | MSR_IA32_PASID_VALID; > > struct ia32_pasid_state *addr; > > > > addr = begin_update_one_xsave_feature(current, XFEATURE_PASID, true); > > addr->pasid = msr_val; > > finish_update_one_xsave_feature(current); > > } > > > > This gets gnarly because we would presumably like to optimize the case where > we can do the update directly in registers. I wonder if we can do it with a > bit of macro magic in a somewhat generic way: Can we defere the optimizations until there is a use case that cares? The existing use case (fixing up the #GP fault by setting the PASID MSR) isn't performance critical. Let's just have something that is clear (or as clear as any xsave code can be) and correct. -Tony ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
Hi, Thomas, On Wed, Sep 29, 2021 at 09:51:15AM -0700, Luck, Tony wrote: > > There is zero requirement to look at TIF_NEED_FPU_LOAD or > > fpregs_state_valid() simply because the #GP comes straight from user > > space which means the FPU registers contain the current tasks user space > > state. > > Just to double confirm ... there is no point in the #GP handler up to this > point > where pre-emption can occur? Same question here. The fixup function is called after cond_local_irq_enable(). If an interrupt comes before fixup_pasid_exception(), the interrupt may use FPU and call kernel_fpu_begin_mask()->set(TIF_NEED_FPU_LOAD)-> __cpu_invalidate_fpregs_state(). Then writing to the IA32_PASID MSR. When exiting to user, the FPU states will be restored to the FPU regs including the IA32_PASID MSR. So the MSR could be different from the value written in fixup_pasid_execption(). Is it possible? Or should fixup_pasid_exception() be called before cond_local_irq_enable()? Thanks. -Fenghua ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
On Wed, Sep 29 2021 at 09:59, Andy Lutomirski wrote: > On 9/29/21 05:28, Thomas Gleixner wrote: >> Looking at that patch again, none of this muck in fpu__pasid_write() is >> required at all. The whole exception fixup is: >> >> if (!user_mode(regs)) >> return false; >> >> if (!current->mm->pasid) >> return false; >> >> if (current->pasid_activated) >> return false; > > <-- preemption or BH here: kaboom. Sigh, this had obviously to run in the early portion of #GP, i.e. before enabling interrupts. For me that's more than obvious and I apologize that I failed to mention it. >> >> wrmsrl(MSR_IA32_PASID, current->mm->pasid); > > This needs the actual sane fpstate writing helper -- see other email. And with that there is no fpstate write helper required at all. Stop this overengineering madness already. Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
On Wed, Sep 29, 2021 at 07:15:53PM +0200, Thomas Gleixner wrote: > On Wed, Sep 29 2021 at 09:59, Andy Lutomirski wrote: > > On 9/29/21 05:28, Thomas Gleixner wrote: > >> Looking at that patch again, none of this muck in fpu__pasid_write() is > >> required at all. The whole exception fixup is: > >> > >> if (!user_mode(regs)) > >> return false; > >> > >> if (!current->mm->pasid) > >> return false; > >> > >> if (current->pasid_activated) > >> return false; > > > > <-- preemption or BH here: kaboom. > > Sigh, this had obviously to run in the early portion of #GP, i.e. before > enabling interrupts. Like this? Obviously with some comment about why this is being done. diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index a58800973aed..a848a59291e7 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -536,6 +536,12 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection) unsigned long gp_addr; int ret; + if (user_mode(regs) && current->mm->pasid && !current->pasid_activated) { + current->pasid_activated = 1; + wrmsrl(MSR_IA32_PASID, current->mm->pasid | MSR_IA32_PASID_VALID); + return; + } + cond_local_irq_enable(regs); if (static_cpu_has(X86_FEATURE_UMIP)) { -Tony ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
On Wed, Sep 29, 2021, at 10:41 AM, Luck, Tony wrote: > On Wed, Sep 29, 2021 at 07:15:53PM +0200, Thomas Gleixner wrote: >> On Wed, Sep 29 2021 at 09:59, Andy Lutomirski wrote: >> > On 9/29/21 05:28, Thomas Gleixner wrote: >> >> Looking at that patch again, none of this muck in fpu__pasid_write() is >> >> required at all. The whole exception fixup is: >> >> >> >> if (!user_mode(regs)) >> >> return false; >> >> >> >> if (!current->mm->pasid) >> >> return false; >> >> >> >> if (current->pasid_activated) >> >>return false; >> > >> > <-- preemption or BH here: kaboom. >> >> Sigh, this had obviously to run in the early portion of #GP, i.e. before >> enabling interrupts. > > Like this? Obviously with some comment about why this is being done. > > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c > index a58800973aed..a848a59291e7 100644 > --- a/arch/x86/kernel/traps.c > +++ b/arch/x86/kernel/traps.c > @@ -536,6 +536,12 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection) > unsigned long gp_addr; > int ret; > > + if (user_mode(regs) && current->mm->pasid && !current->pasid_activated) > { > + current->pasid_activated = 1; > + wrmsrl(MSR_IA32_PASID, current->mm->pasid | > MSR_IA32_PASID_VALID); > + return; > + } > + This could do with a WARN_ON_ONCE(TIF_NEED_LOAD_FPU) imo. Is instrumentation allowed to touch FPU state? > cond_local_irq_enable(regs); > > if (static_cpu_has(X86_FEATURE_UMIP)) { > > -Tony ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
On Wed, Sep 29, 2021, at 10:07 AM, Luck, Tony wrote: > On Wed, Sep 29, 2021 at 09:58:22AM -0700, Andy Lutomirski wrote: >> On 9/28/21 16:10, Luck, Tony wrote: >> > Moving beyond pseudo-code and into compiles-but-probably-broken-code. >> > >> > >> > The intent of the functions below is that Fenghua should be able to >> > do: >> > >> > void fpu__pasid_write(u32 pasid) >> > { >> >u64 msr_val = pasid | MSR_IA32_PASID_VALID; >> >struct ia32_pasid_state *addr; >> > >> >addr = begin_update_one_xsave_feature(current, XFEATURE_PASID, true); >> >addr->pasid = msr_val; >> >finish_update_one_xsave_feature(current); >> > } >> > >> >> This gets gnarly because we would presumably like to optimize the case where >> we can do the update directly in registers. I wonder if we can do it with a >> bit of macro magic in a somewhat generic way: > > Can we defere the optimizations until there is a use case that > cares? The existing use case (fixing up the #GP fault by setting > the PASID MSR) isn't performance critical. > > Let's just have something that is clear (or as clear as any xsave > code can be) and correct. > > The goal would be to use the same code for CET and PKRU, I think. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
Hi, Tony, On Wed, Sep 29, 2021 at 10:41:42AM -0700, Luck, Tony wrote: > On Wed, Sep 29, 2021 at 07:15:53PM +0200, Thomas Gleixner wrote: > > On Wed, Sep 29 2021 at 09:59, Andy Lutomirski wrote: > > > On 9/29/21 05:28, Thomas Gleixner wrote: > > >> Looking at that patch again, none of this muck in fpu__pasid_write() is > > >> required at all. The whole exception fixup is: > > >> > > >> if (!user_mode(regs)) > > >> return false; > > >> > > >> if (!current->mm->pasid) > > >> return false; > > >> > > >> if (current->pasid_activated) > > >> return false; > > > > > > <-- preemption or BH here: kaboom. > > > > Sigh, this had obviously to run in the early portion of #GP, i.e. before > > enabling interrupts. > > Like this? Obviously with some comment about why this is being done. > > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c > index a58800973aed..a848a59291e7 100644 > --- a/arch/x86/kernel/traps.c > +++ b/arch/x86/kernel/traps.c > @@ -536,6 +536,12 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection) > unsigned long gp_addr; > int ret; > Add +#ifdef CONFIG_IOMMU_SUPPORT because mm->pasid and current-pasid_activated are defined only if CONFIG_IOMMU_SUPPORT is defined. > + if (user_mode(regs) && current->mm->pasid && !current->pasid_activated) > { Maybe need to add "&& cpu_feature_enabled(X86_FEATURE_ENQCMD)" because the IA32_PASID MSR is only used when ENQCMD is enabled? > + current->pasid_activated = 1; > + wrmsrl(MSR_IA32_PASID, current->mm->pasid | > MSR_IA32_PASID_VALID); > + return; > + } > + +endif > cond_local_irq_enable(regs); > > if (static_cpu_has(X86_FEATURE_UMIP)) { Thanks. -Fenghua ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] [RFC] qcom_scm: hide Kconfig symbol
On Wed, Sep 29, 2021 at 4:46 PM Bjorn Andersson wrote: > > On Wed 29 Sep 05:04 CDT 2021, Arnd Bergmann wrote: > > > On Wed, Sep 29, 2021 at 11:51 AM Will Deacon wrote: > > > On Mon, Sep 27, 2021 at 05:22:13PM +0200, Arnd Bergmann wrote: > > > > > > > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig > > > > index 124c41adeca1..989c83acbfee 100644 > > > > --- a/drivers/iommu/Kconfig > > > > +++ b/drivers/iommu/Kconfig > > > > @@ -308,7 +308,7 @@ config APPLE_DART > > > > config ARM_SMMU > > > > tristate "ARM Ltd. System MMU (SMMU) Support" > > > > depends on ARM64 || ARM || (COMPILE_TEST && !GENERIC_ATOMIC64) > > > > - depends on QCOM_SCM || !QCOM_SCM #if QCOM_SCM=m this can't be =y > > > > + select QCOM_SCM > > > > select IOMMU_API > > > > select IOMMU_IO_PGTABLE_LPAE > > > > select ARM_DMA_USE_IOMMU if ARM > > > > > > I don't want to get in the way of this patch because I'm also tired of the > > > randconfig failures caused by QCOM_SCM. However, ARM_SMMU is applicable to > > > a wide variety of (non-qcom) SoCs and so it seems a shame to require the > > > QCOM_SCM code to be included for all of those when it's not strictly > > > needed > > > at all. > > > > Good point, I agree that needs to be fixed. I think this additional > > change should do the trick: > > > > ARM_SMMU and QCOM_IOMMU are two separate implementations and both uses > QCOM_SCM. So both of them should select QCOM_SCM. Right, I figured that out later as well. > "Unfortunately" the Qualcomm portion of ARM_SMMU is builtin > unconditionally, so going with something like select QCOM_SCM if > ARCH_QCOM would still require the stubs in qcom_scm.h. Yes, sounds good. I also noticed that I still need one hack in there if I do this: diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c index 55690af1b25d..36c304a8fc9b 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c @@ -427,6 +427,9 @@ struct arm_smmu_device *qcom_smmu_impl_init(struct arm_smmu_device *smmu) { const struct device_node *np = smmu->dev->of_node; + if (!IS_ENABLED(CONFIG_QCOM_SCM)) + return ERR_PTR(-ENXIO); + #ifdef CONFIG_ACPI if (np == NULL) { /* Match platform for ACPI boot */ Otherwise it still breaks with ARM_SMMU=y and QCOM_SCM=m. Splitting out the qualcomm portion of the arm_smmu driver using a separate 'bool' symbol should also work, if you prefer that and can suggest a name and help text for that symbol. It would look like diff --git a/drivers/iommu/arm/arm-smmu/Makefile b/drivers/iommu/arm/arm-smmu/Makefile index e240a7bcf310..b0cc01aa20c9 100644 --- a/drivers/iommu/arm/arm-smmu/Makefile +++ b/drivers/iommu/arm/arm-smmu/Makefile @@ -1,4 +1,5 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o obj-$(CONFIG_ARM_SMMU) += arm_smmu.o -arm_smmu-objs += arm-smmu.o arm-smmu-impl.o arm-smmu-nvidia.o arm-smmu-qcom.o +arm_smmu-objs += arm-smmu.o arm-smmu-impl.o arm-smmu-nvidia.o +arm_smmu-$(CONFIG_ARM_SMMU_QCOM) += arm-smmu-qcom.o diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-impl.c b/drivers/iommu/arm/arm-smmu/arm-smmu-impl.c index 9f465e146799..2c25cce38060 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu-impl.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-impl.c @@ -215,7 +215,8 @@ struct arm_smmu_device *arm_smmu_impl_init(struct arm_smmu_device *smmu) of_device_is_compatible(np, "nvidia,tegra186-smmu")) return nvidia_smmu_impl_init(smmu); - smmu = qcom_smmu_impl_init(smmu); + if (IS_ENABLED(CONFIG_ARM_SMMU_QCOM)) + smmu = qcom_smmu_impl_init(smmu); if (of_device_is_compatible(np, "marvell,ap806-smmu-500")) smmu->impl = &mrvl_mmu500_impl; Arnd ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
On Wed, Sep 29, 2021 at 06:07:22PM +, Fenghua Yu wrote: > Add > +#ifdef CONFIG_IOMMU_SUPPORT > because mm->pasid and current-pasid_activated are defined only if > CONFIG_IOMMU_SUPPORT is defined. > > > + if (user_mode(regs) && current->mm->pasid && !current->pasid_activated) > > { > > Maybe need to add "&& cpu_feature_enabled(X86_FEATURE_ENQCMD)" because > the IA32_PASID MSR is only used when ENQCMD is enabled? > > > + current->pasid_activated = 1; > > + wrmsrl(MSR_IA32_PASID, current->mm->pasid | > > MSR_IA32_PASID_VALID); > > + return; > > + } > > + > > +endif New version that addresses those issues. Has ugly #ifdef in C code :-( If that's unacceptable, then could create some stub functions, or add a call to __try_fixup_pasid() that's in a file in the iommu code that is only built when CONFIG_IOMMU_SUPPORT is set. But either of those move the details far away from the #GP handler so make extra work for anyone trying to follow along with what is happening here. -Tony --- diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index a58800973aed..5a3c87fd65de 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -528,6 +528,32 @@ static enum kernel_gp_hint get_kernel_gp_address(struct pt_regs *regs, #define GPFSTR "general protection fault" +/* + * When a user executes the ENQCMD instruction it will #GP + * fault if the IA32_PASID MSR has not been set up with a + * valid PASID. + * So if the process has been allocated a PASID (mm->pasid) + * AND the IA32_PASID MSR has not been initialized, try to + * fix this #GP by initializing the IA32_PASID MSR. + * If the #GP was for some other reason, it will trigger + * again, but this routine will return false and the #GP + * will be processed. + */ +static void try_fixup_pasid(void) +{ + if (!cpu_feature_enabled(X86_FEATURE_ENQCMD)) + return false; + +#ifdef CONFIG_IOMMU_SUPPORT + if (current->mm->pasid && !current->pasid_activated) { + current->pasid_activated = 1; + wrmsrl(MSR_IA32_PASID, current->mm->pasid); + return true; + } +#endif + return false; +} + DEFINE_IDTENTRY_ERRORCODE(exc_general_protection) { char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR; @@ -536,6 +562,9 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection) unsigned long gp_addr; int ret; + if (user_mode(regs) && try_fixup_pasid()) + return; + cond_local_irq_enable(regs); if (static_cpu_has(X86_FEATURE_UMIP)) { ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
On Wed, 29 Sep 2021 12:08:59 +1000 David Gibson wrote: > On Sun, Sep 19, 2021 at 02:38:30PM +0800, Liu Yi L wrote: > > This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for > > userspace to directly open a vfio device w/o relying on container/group > > (/dev/vfio/$GROUP). Anything related to group is now hidden behind > > iommufd (more specifically in iommu core by this RFC) in a device-centric > > manner. > > > > In case a device is exposed in both legacy and new interfaces (see next > > patch for how to decide it), this patch also ensures that when the device > > is already opened via one interface then the other one must be blocked. > > > > Signed-off-by: Liu Yi L > [snip] > > > +static bool vfio_device_in_container(struct vfio_device *device) > > +{ > > + return !!(device->group && device->group->container); > > You don't need !! here. && is already a logical operation, so returns > a valid bool. > > > +} > > + > > static int vfio_device_fops_release(struct inode *inode, struct file > > *filep) > > { > > struct vfio_device *device = filep->private_data; > > @@ -1560,7 +1691,16 @@ static int vfio_device_fops_release(struct inode > > *inode, struct file *filep) > > > > module_put(device->dev->driver->owner); > > > > - vfio_group_try_dissolve_container(device->group); > > + if (vfio_device_in_container(device)) { > > + vfio_group_try_dissolve_container(device->group); > > + } else { > > + atomic_dec(&device->opened); > > + if (device->group) { > > + mutex_lock(&device->group->opened_lock); > > + device->group->opened--; > > + mutex_unlock(&device->group->opened_lock); > > + } > > + } > > > > vfio_device_put(device); > > > > @@ -1613,6 +1753,7 @@ static int vfio_device_fops_mmap(struct file *filep, > > struct vm_area_struct *vma) > > > > static const struct file_operations vfio_device_fops = { > > .owner = THIS_MODULE, > > + .open = vfio_device_fops_open, > > .release= vfio_device_fops_release, > > .read = vfio_device_fops_read, > > .write = vfio_device_fops_write, > > @@ -2295,6 +2436,52 @@ static struct miscdevice vfio_dev = { > > .mode = S_IRUGO | S_IWUGO, > > }; > > > > +static char *vfio_device_devnode(struct device *dev, umode_t *mode) > > +{ > > + return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev)); > > Others have pointed out some problems with the use of dev_name() > here. I'll add that I think you'll make things much easier if instead > of using one huge "devices" subdir, you use a separate subdir for each > vfio sub-driver (so, one for PCI, one for each type of mdev, one for > platform, etc.). That should make avoiding name conflicts a lot simpler. It seems like this is unnecessary if we use the vfioX naming approach. Conflicts are trivial to ignore if we don't involve dev_name() and looking for the correct major:minor chardev in the correct subdirectory seems like a hassle for userspace. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA
Hi, Just to follow up on what we discussed during LPC VFIO/IOMMU/PCI MC. https://linuxplumbersconf.org/event/11/contributions/1021/ The key takeaways are: 1. Addressing mode selections (PA, IOVA, and KVA) should be a policy decision *not* to be made by device drivers. This implies that it is up to the platform code or user (via some sysfs knobs) to decide what is the best for each device. Drivers should not be aware of what addressing modes is returned by DMA API. 2. DMA APIs can be extended to support DMA request with PASID. 3. Performance benefit of using KVA (shared) should be demonstrated. Though the saving of IOTLB flush over IOVA is conceivable. #1 could be done in platform IOMMU code when devices are attached to their default domains. E.g. if the device is trusted, it can operate at shared KVA mode. For #2, it seems we can store the kernel PASID in struct device. This will preserve the DMA API interface while making it PASID capable. Essentially, each PASID capable device would have two special global PASIDs: - PASID 0 for DMA request w/o PASID, aka RID2PASID - PASID 1 (randomly selected) for in-kernel DMA request w/ PASID Both PASID 0 and 1 will always point to the same page table. I.e. same addressing mode, IOVA or KVA. For devices does not support PASID, there is no change. For devices can do both DMA w/ and w/o PASID, the IOTLB invalidation would include both PASIDs. By embedding PASID in struct device, it also avoided changes in upper level APIs. DMA engine API can continue to give out channels without knowing whether PASID is used or not. The accelerator drivers that does work submission can retrieve PASID from struct device. Thoughts? Thanks for the review and feedback at LPC! Jacob On Tue, 21 Sep 2021 13:29:34 -0700, Jacob Pan wrote: > Hi Joerg/Jason/Christoph et all, > > The current in-kernel supervisor PASID support is based on the SVM/SVA > machinery in sva-lib. Kernel SVA is achieved by extending a special flag > to indicate the binding of the device and a page table should be performed > on init_mm instead of the mm of the current process.Page requests and > other differences between user and kernel SVA are handled as special > cases. > > This unrestricted binding with the kernel page table is being challenged > for security and the convention that in-kernel DMA must be compatible with > DMA APIs. > (https://lore.kernel.org/linux-iommu/20210511194726.gp1002...@nvidia.com/) > There is also the lack of IOTLB synchronization upon kernel page table > updates. > > This patchset is trying to address these concerns by having an explicit > DMA API compatible model while continue to support in-kernel use of DMA > requests with PASID. Specifically, the following DMA-IOMMU APIs are > introduced: > > int iommu_dma_pasid_enable/disable(struct device *dev, > struct iommu_domain **domain, > enum iommu_dma_pasid_mode mode); > int iommu_map/unmap_kva(struct iommu_domain *domain, > void *cpu_addr,size_t size, int prot); > > The following three addressing modes are supported with example API usages > by device drivers. > > 1. Physical address (bypass) mode. Similar to DMA direct where trusted > devices can DMA pass through IOMMU on a per PASID basis. > Example: > pasid = iommu_dma_pasid_enable(dev, NULL, IOMMU_DMA_PASID_BYPASS); > /* Use the returning PASID and PA for work submission */ > > 2. IOVA mode. DMA API compatible. Map a supervisor PASID the same way as > the PCI requester ID (RID) > Example: > pasid = iommu_dma_pasid_enable(dev, NULL, IOMMU_DMA_PASID_IOVA); > /* Use the PASID and DMA API allocated IOVA for work submission */ > > 3. KVA mode. New kva map/unmap APIs. Support fast and strict sub-modes > transparently based on device trustfulness. > Example: > pasid = iommu_dma_pasid_enable(dev, &domain, IOMMU_DMA_PASID_KVA); > iommu_map_kva(domain, &buf, size, prot); > /* Use the returned PASID and KVA to submit work */ > Where: > Fast mode: Shared CPU page tables for trusted devices only > Strict mode: IOMMU domain returned for the untrusted device to > replicate KVA-PA mapping in IOMMU page tables. > > On a per device basis, DMA address and performance modes are enabled by > the device drivers. Platform information such as trustability, user > command line input (not included in this set) could also be taken into > consideration (not implemented in this RFC). > > This RFC is intended to communicate the API directions. Little testing is > done outside IDXD and DMA engine tests. > > For PA and IOVA modes, the implementation is straightforward and tested > with Intel IDXD driver. But several opens remain in KVA fast mode thus > not tested: 1. Lack of IOTLB synchronization, kernel direct map alias can > be updated as a result of module loading/eBPF load. Adding kernel mmu > notifier? 2. The use of the auxiliary d
Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA
On Wed, Sep 29, 2021 at 12:37:19PM -0700, Jacob Pan wrote: > For #2, it seems we can store the kernel PASID in struct device. This will > preserve the DMA API interface while making it PASID capable. Essentially, > each PASID capable device would have two special global PASIDs: > - PASID 0 for DMA request w/o PASID, aka RID2PASID > - PASID 1 (randomly selected) for in-kernel DMA request w/ PASID This seems reasonable, I had the same thought. Basically just have the driver issue some trivial call: pci_enable_pasid_dma(pdev, &pasid) And then DMA tagged with the PASID will be handled equivilant to untagged DMA. Basically PASID and no PASID point to the exact same IO page table and the DMA API manipulates that single page table. Having multiple RID's pointing at the same IO page table is something we expect iommufd to require so the whole thing should ideally fall out naturally. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
On Wed, Sep 29 2021 at 11:31, Tony Luck wrote: > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c > index a58800973aed..5a3c87fd65de 100644 > --- a/arch/x86/kernel/traps.c > +++ b/arch/x86/kernel/traps.c > @@ -528,6 +528,32 @@ static enum kernel_gp_hint get_kernel_gp_address(struct > pt_regs *regs, > > #define GPFSTR "general protection fault" > > +/* > + * When a user executes the ENQCMD instruction it will #GP > + * fault if the IA32_PASID MSR has not been set up with a > + * valid PASID. > + * So if the process has been allocated a PASID (mm->pasid) > + * AND the IA32_PASID MSR has not been initialized, try to > + * fix this #GP by initializing the IA32_PASID MSR. > + * If the #GP was for some other reason, it will trigger > + * again, but this routine will return false and the #GP > + * will be processed. > + */ > +static void try_fixup_pasid(void) > +{ > + if (!cpu_feature_enabled(X86_FEATURE_ENQCMD)) > + return false; > + > +#ifdef CONFIG_IOMMU_SUPPORT > + if (current->mm->pasid && !current->pasid_activated) { > + current->pasid_activated = 1; > + wrmsrl(MSR_IA32_PASID, current->mm->pasid); > + return true; > + } > +#endif > + return false; > +} > + > DEFINE_IDTENTRY_ERRORCODE(exc_general_protection) > { > char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR; > @@ -536,6 +562,9 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection) > unsigned long gp_addr; > int ret; > > + if (user_mode(regs) && try_fixup_pasid()) > + return; > + > cond_local_irq_enable(regs); > > if (static_cpu_has(X86_FEATURE_UMIP)) { Amen. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
On 2021-09-28 12:32 p.m., Jason Gunthorpe wrote: > On Thu, Sep 16, 2021 at 05:40:41PM -0600, Logan Gunthorpe wrote: >> config PCI_P2PDMA >> bool "PCI peer-to-peer transfer support" >> -depends on ZONE_DEVICE >> +depends on ZONE_DEVICE && 64BIT > > Perhaps a comment to explain what the 64bit is doing? Added. >> select GENERIC_ALLOCATOR >> help >>Enableѕ drivers to do PCI peer-to-peer transactions to and from >> diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h >> index 266754a55327..e62b1cf6386f 100644 >> +++ b/include/linux/scatterlist.h >> @@ -64,6 +64,21 @@ struct sg_append_table { >> #define SG_CHAIN0x01UL >> #define SG_END 0x02UL >> >> +/* >> + * bit 2 is the third free bit in the page_link on 64bit systems which >> + * is used by dma_unmap_sg() to determine if the dma_address is a PCI >> + * bus address when doing P2PDMA. >> + * Note: CONFIG_PCI_P2PDMA depends on CONFIG_64BIT because of this. >> + */ >> + >> +#ifdef CONFIG_PCI_P2PDMA >> +#define SG_DMA_PCI_P2PDMA 0x04UL > > Add a > static_assert(__alignof__(void *) == 8); > > ? Good idea. Though, I think your line isn't quite correct. I've added: static_assert(__alignof__(struct page) >= 8); >> +#define sg_is_dma_pci_p2pdma(sg) ((sg)->page_link & SG_DMA_PCI_P2PDMA) > > I've been encouraging people to use static inlines more.. I also prefer static inlines, but I usually follow the style of the code I'm changing. In any case, I've changed to static inlines similar to your example. >> /** >> * sg_assign_page - Assign a given page to an SG entry >> @@ -86,13 +103,13 @@ struct sg_append_table { >> **/ >> static inline void sg_assign_page(struct scatterlist *sg, struct page *page) >> { >> -unsigned long page_link = sg->page_link & (SG_CHAIN | SG_END); >> +unsigned long page_link = sg->page_link & SG_PAGE_LINK_MASK; > > I think this should just be '& SG_END', sg_assign_page() doesn't look > like it should ever be used on a sg_chain entry, so this is just > trying to preserve the end stamp. Perhaps, but I'm not comfortable making that change in this patch or series. Though, I've reverted this specific change in my patch so sg_assign_page() will clear SG_DMA_PCI_P2PDMA. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 04/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations
On 2021-09-28 12:55 p.m., Jason Gunthorpe wrote: > On Thu, Sep 16, 2021 at 05:40:44PM -0600, Logan Gunthorpe wrote: >> Add pci_p2pdma_map_segment() as a helper for simple dma_map_sg() >> implementations. It takes an scatterlist segment that must point to a >> pci_p2pdma struct page and will map it if the mapping requires a bus >> address. >> >> The return value indicates whether the mapping required a bus address >> or whether the caller still needs to map the segment normally. If the >> segment should not be mapped, -EREMOTEIO is returned. >> >> This helper uses a state structure to track the changes to the >> pgmap across calls and avoid needing to lookup into the xarray for >> every page. >> >> Also add pci_p2pdma_map_bus_segment() which is useful for IOMMU >> dma_map_sg() implementations where the sg segment containing the page >> differs from the sg segment containing the DMA address. >> >> Signed-off-by: Logan Gunthorpe >> drivers/pci/p2pdma.c | 59 ++ >> include/linux/pci-p2pdma.h | 21 ++ >> 2 files changed, 80 insertions(+) >> >> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c >> index b656d8c801a7..58c34f1f1473 100644 >> +++ b/drivers/pci/p2pdma.c >> @@ -943,6 +943,65 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, >> struct scatterlist *sg, >> } >> EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs); >> >> +/** >> + * pci_p2pdma_map_segment - map an sg segment determining the mapping type >> + * @state: State structure that should be declared outside of the >> for_each_sg() >> + * loop and initialized to zero. >> + * @dev: DMA device that's doing the mapping operation >> + * @sg: scatterlist segment to map >> + * >> + * This is a helper to be used by non-iommu dma_map_sg() implementations >> where >> + * the sg segment is the same for the page_link and the dma_address. >> + * >> + * Attempt to map a single segment in an SGL with the PCI bus address. >> + * The segment must point to a PCI P2PDMA page and thus must be >> + * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check. >> + * >> + * Returns the type of mapping used and maps the page if the type is >> + * PCI_P2PDMA_MAP_BUS_ADDR. >> + */ >> +enum pci_p2pdma_map_type >> +pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device >> *dev, >> + struct scatterlist *sg) >> +{ >> +if (state->pgmap != sg_page(sg)->pgmap) { >> +state->pgmap = sg_page(sg)->pgmap; >> +state->map = pci_p2pdma_map_type(state->pgmap, dev); >> +state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset; >> +} > > Is this safe? I was just talking with Joao about this, > > https://lore.kernel.org/r/20210928180150.gi3544...@ziepe.ca > I agree that taking the extra reference on the pagemap seems unnecessary. However, it was convenient for the purposes of this patchset to not have to check the page type for every page and only on every new page map. But if we need to add a check directly to gup, that'd probably be fine too. > API wise I absolutely think this should be safe as written, but is it > really? > > Does pgmap ensure that a positive refcount struct page always has a > valid pgmap pointer (and thus the mess in gup can be deleted) or does > this need to get the pgmap as well to keep it alive? Yes, the P2PDMA code ensures that the pgmap will not be deleted if there is a positive refcount on any struct page. LOgan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 4/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations
On 2021-09-28 4:05 p.m., Jason Gunthorpe wrote: > On Thu, Sep 16, 2021 at 05:40:44PM -0600, Logan Gunthorpe wrote: > >> +enum pci_p2pdma_map_type >> +pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device >> *dev, >> + struct scatterlist *sg) >> +{ >> +if (state->pgmap != sg_page(sg)->pgmap) { >> +state->pgmap = sg_page(sg)->pgmap; > > This has built into it an assumption that every page in the sg element > has the same pgmap, but AFAIK nothing enforces this rule? There is no > requirement that the HW has pfn gaps between the pgmaps linux decides > to create over it. No, that's not a correct reading of the code. Every time there is a new pagemap, this code calculates the mapping type and bus offset. If a page comes along with a different page map,f it recalculates. This just reduces the overhead so that the calculation is done only every time a page with a different pgmap comes along and not doing it for every single page. > At least sg_alloc_append_table_from_pages() and probably something in > the block world should be updated to not combine struct pages with > different pgmaps, and this should be documented in scatterlist.* > someplace. There's no sane place to do this check. The code is designed to support mappings with different pgmaps. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 14/20] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
On 2021-09-28 1:47 p.m., Jason Gunthorpe wrote: > On Thu, Sep 16, 2021 at 05:40:54PM -0600, Logan Gunthorpe wrote: >> Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to >> allow obtaining P2PDMA pages. If a caller does not set this flag >> and tries to map P2PDMA pages it will fail. >> >> This is implemented by adding a flag and a check to get_dev_pagemap(). > > I would like to see the get_dev_pagemap() deleted from GUP in the > first place. > > Why isn't this just a simple check of the page->pgmap type after > acquiring a valid page reference? See my prior note It could be, but that will mean dereferencing the pgmap for every page to determine the type of page and then comparing with FOLL_PCI_P2PDMA. Probably not terrible to go this way. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
On 2021-09-28 1:55 p.m., Jason Gunthorpe wrote: > On Thu, Sep 16, 2021 at 05:40:59PM -0600, Logan Gunthorpe wrote: >> +int pci_mmap_p2pmem(struct pci_dev *pdev, struct vm_area_struct *vma) >> +{ >> +struct pci_p2pdma_map *pmap; >> +struct pci_p2pdma *p2pdma; >> +int ret; >> + >> +/* prevent private mappings from being established */ >> +if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) { >> +pci_info_ratelimited(pdev, >> + "%s: fail, attempted private mapping\n", >> + current->comm); >> +return -EINVAL; >> +} >> + >> +pmap = pci_p2pdma_map_alloc(pdev, vma->vm_end - vma->vm_start); >> +if (!pmap) >> +return -ENOMEM; >> + >> +rcu_read_lock(); >> +p2pdma = rcu_dereference(pdev->p2pdma); >> +if (!p2pdma) { >> +ret = -ENODEV; >> +goto out; >> +} >> + >> +ret = simple_pin_fs(&pci_p2pdma_fs_type, &pci_p2pdma_fs_mnt, >> +&pci_p2pdma_fs_cnt); >> +if (ret) >> +goto out; >> + >> +ihold(p2pdma->inode); >> +pmap->inode = p2pdma->inode; >> +rcu_read_unlock(); >> + >> +vma->vm_flags |= VM_MIXEDMAP; > > Why is this a VM_MIXEDMAP? Everything fault sticks in here has a > struct page, right? Yes. This decision is not so simple, I tried a few variations before settling on this. The main reason is probably this: if we don't use VM_MIXEDMAP, then we can't set pte_devmap(). If we don't set pte_devmap(), then every single page that GUP processes needs to check if it's a ZONE_DEVICE page and also if it's a P2PDMA page (thus dereferencing pgmap) in order to satisfy the requirements of FOLL_PCI_P2PDMA. I didn't think other developers would go for that kind of performance hit. With VM_MIXEDMAP we hide the performance penalty behind the existing check. And with the current pgmap code as is, we only need to do that check once for every new pgmap pointer. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
On 2021-09-28 2:05 p.m., Jason Gunthorpe wrote: > On Thu, Sep 16, 2021 at 05:40:59PM -0600, Logan Gunthorpe wrote: > >> +static void pci_p2pdma_unmap_mappings(void *data) >> +{ >> +struct pci_dev *pdev = data; >> +struct pci_p2pdma *p2pdma = rcu_dereference_protected(pdev->p2pdma, 1); >> + >> +p2pdma->active = false; >> +synchronize_rcu(); >> +unmap_mapping_range(p2pdma->inode->i_mapping, 0, 0, 1); >> +pci_p2pdma_free_mappings(p2pdma->inode->i_mapping); >> +} > > If this is going to rely on unmap_mapping_range then GUP should also > reject this memory for FOLL_LONGTERM.. Right, makes sense. > > What along this control flow: > >> + error = devm_add_action_or_reset(&pdev->dev, >> pci_p2pdma_unmap_mappings, >> +pdev); > > Waits for all the page refcounts to go to zero? That's already in the existing code as part of memunmap_pages() which puts the original reference to all the pages and then waits for the reference to go to zero. This new action unmaps all the VMAs so that the subsequent call to memunmap_pages() doesn't block on userspace processes. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices
On 2021-09-28 2:02 p.m., Jason Gunthorpe wrote: > On Thu, Sep 16, 2021 at 05:40:40PM -0600, Logan Gunthorpe wrote: >> Hi, >> >> This patchset continues my work to add userspace P2PDMA access using >> O_DIRECT NVMe devices. My last posting[1] just included the first 13 >> patches in this series, but the early P2PDMA cleanup and map_sg error >> changes from that series have been merged into v5.15-rc1. To address >> concerns that that series did not add any new functionality, I've added >> back the userspcae functionality from the original RFC[2] (but improved >> based on the original feedback). > > I really think this is the best series yet, it really looks nice > overall. I know the sg flag was a bit of a debate at the start, but it > serves an undeniable purpose and the resulting standard DMA APIs 'just > working' is really clean. Actually, so far, nobody has said anything negative about using the SG flag. > There is more possible here, we could also pass the new GUP flag in the > ib_umem code.. Yes, that would be very useful. > After this gets merged I would make a series to split out the CMD > genalloc related stuff and try and probably get something like VFIO to > export this kind of memory as well, then it would have pretty nice > coverage. Yup! Thanks for the review. Anything I didn't respond to I've either made changes for, or am still working on and will be addressed for subsequent postings. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
> From: Jason Gunthorpe > Sent: Wednesday, September 29, 2021 8:28 PM > > On Wed, Sep 29, 2021 at 06:41:00AM +, Tian, Kevin wrote: > > > From: David Gibson > > > Sent: Wednesday, September 29, 2021 2:01 PM > > > > > > On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote: > > > > This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind > the > > > vfio > > > > device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface > is > > > provided > > > > because it's implicitly done when the device fd is closed. > > > > > > > > In concept a vfio device can be bound to multiple iommufds, each > hosting > > > > a subset of I/O address spaces attached by this device. > > > > > > I really feel like this many<->many mapping between devices is going > > > to be super-confusing, and therefore make it really hard to be > > > confident we have all the rules right for proper isolation. > > > > Based on new discussion on group ownership part (patch06), I feel this > > many<->many relationship will disappear. The context fd (either container > > or iommufd) will uniquely mark the ownership on a physical device and > > its group. With this design it's impractical to have one device bound > > to multiple iommufds. > > That should be a requirement! We have no way to prove that two > iommufds are the same security domain, so devices/groups cannot be > shared. > > That is why the API I suggested takes in a struct file to ID the user > security context. A group is accessible only from that single struct > file and no more. > > If the first series goes the way I outlined then I think David's > concern about security is strongly solved as the IOMMU layer is > directly managing it with a very clear responsiblity and semantic. > Yes, this is also my understanding now. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 4/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations
On Wed, Sep 29, 2021 at 03:30:42PM -0600, Logan Gunthorpe wrote: > > > > On 2021-09-28 4:05 p.m., Jason Gunthorpe wrote: > > On Thu, Sep 16, 2021 at 05:40:44PM -0600, Logan Gunthorpe wrote: > > > >> +enum pci_p2pdma_map_type > >> +pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device > >> *dev, > >> + struct scatterlist *sg) > >> +{ > >> + if (state->pgmap != sg_page(sg)->pgmap) { > >> + state->pgmap = sg_page(sg)->pgmap; > > > > This has built into it an assumption that every page in the sg element > > has the same pgmap, but AFAIK nothing enforces this rule? There is no > > requirement that the HW has pfn gaps between the pgmaps linux decides > > to create over it. > > No, that's not a correct reading of the code. Every time there is a new > pagemap, this code calculates the mapping type and bus offset. If a page > comes along with a different page map,f it recalculates. This just > reduces the overhead so that the calculation is done only every time a > page with a different pgmap comes along and not doing it for every > single page. Each 'struct scatterlist *sg' refers to a range of contiguous pfns starting at page_to_pfn(sg_page()) and going for approx sg->length/PAGE_SIZE pfns long. sg_page() returns the first page, but nothing says that sg_page()+1 has the same pgmap. The code in this patch does check the first page of each sg in a larger sgl. > > At least sg_alloc_append_table_from_pages() and probably something in > > the block world should be updated to not combine struct pages with > > different pgmaps, and this should be documented in scatterlist.* > > someplace. > > There's no sane place to do this check. The code is designed to support > mappings with different pgmaps. All places that generate compound sg's by aggregating multiple pages need to include this check along side the check for physical contiguity. There are not that many places but sg_alloc_append_table_from_pages() is one of them: @@ -470,7 +470,8 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append, /* Merge contiguous pages into the last SG */ prv_len = sgt_append->prv->length; - while (n_pages && page_to_pfn(pages[0]) == paddr) { + while (n_pages && page_to_pfn(pages[0]) == paddr && + sg_page(sgt_append->prv)->pgmap == pages[0]->pgmap) { if (sgt_append->prv->length + PAGE_SIZE > max_segment) break; sgt_append->prv->length += PAGE_SIZE; @@ -488,7 +489,8 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append, for (i = 1; i < n_pages; i++) { seg_len += PAGE_SIZE; if (seg_len >= max_segment || - page_to_pfn(pages[i]) != page_to_pfn(pages[i - 1]) + 1) { + page_to_pfn(pages[i]) != page_to_pfn(pages[i - 1]) + 1 || + pages[i]->pgmap != pages[i - 1]->pgmap) { chunks++; seg_len = 0; } @@ -505,9 +507,10 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append, seg_len += PAGE_SIZE; if (seg_len >= max_segment || page_to_pfn(pages[j]) != - page_to_pfn(pages[j - 1]) + 1) + page_to_pfn(pages[j - 1]) + 1 || + pages[i]->pgmap != pages[i - 1]->pgmap) { break; - } + } /* Pass how many chunks might be left */ s = get_next_sg(sgt_append, s, chunks - i + left_pages, ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 14/20] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
On Wed, Sep 29, 2021 at 03:34:22PM -0600, Logan Gunthorpe wrote: > > > > On 2021-09-28 1:47 p.m., Jason Gunthorpe wrote: > > On Thu, Sep 16, 2021 at 05:40:54PM -0600, Logan Gunthorpe wrote: > >> Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to > >> allow obtaining P2PDMA pages. If a caller does not set this flag > >> and tries to map P2PDMA pages it will fail. > >> > >> This is implemented by adding a flag and a check to get_dev_pagemap(). > > > > I would like to see the get_dev_pagemap() deleted from GUP in the > > first place. > > > > Why isn't this just a simple check of the page->pgmap type after > > acquiring a valid page reference? See my prior note > > It could be, but that will mean dereferencing the pgmap for every page > to determine the type of page and then comparing with FOLL_PCI_P2PDMA. It would be done under the pte devmap test and this is less expensive than the xarray search. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA
Hi Jason, On Wed, 29 Sep 2021 16:39:53 -0300, Jason Gunthorpe wrote: > On Wed, Sep 29, 2021 at 12:37:19PM -0700, Jacob Pan wrote: > > > For #2, it seems we can store the kernel PASID in struct device. This > > will preserve the DMA API interface while making it PASID capable. > > Essentially, each PASID capable device would have two special global > > PASIDs: > > - PASID 0 for DMA request w/o PASID, aka RID2PASID > > - PASID 1 (randomly selected) for in-kernel DMA request w/ > > PASID > > This seems reasonable, I had the same thought. Basically just have the > driver issue some trivial call: > pci_enable_pasid_dma(pdev, &pasid) That would work, but I guess it needs to be an iommu_ call instead of pci_? Or, it can be done by the platform IOMMU code where system PASID is automatically enabled for PASID capable devices during boot and stored in struct device. Device drivers can retrieve the PASID from struct device. I think your suggestion is more precise, in case the driver does not want to do DMA w/ PASID, we can do less IOTLB flush (PASID 0 only). > And then DMA tagged with the PASID will be handled equivilant to > untagged DMA. Basically PASID and no PASID point to the exact same IO > page table and the DMA API manipulates that single page table. > > Having multiple RID's pointing at the same IO page table is something > we expect iommufd to require so the whole thing should ideally fall > out naturally. That would be the equivalent of attaching multiple devices to the same IOMMU domain. right? > Jason Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 12/20] RDMA/rw: use dma_map_sgtable()
On 2021-09-28 1:43 p.m., Jason Gunthorpe wrote: > On Thu, Sep 16, 2021 at 05:40:52PM -0600, Logan Gunthorpe wrote: >> dma_map_sg() now supports the use of P2PDMA pages so pci_p2pdma_map_sg() >> is no longer necessary and may be dropped. >> >> Switch to the dma_map_sgtable() interface which will allow for better >> error reporting if the P2PDMA pages are unsupported. >> >> The change to sgtable also appears to fix a couple subtle error path >> bugs: >> >> - In rdma_rw_ctx_init(), dma_unmap would be called with an sg >> that could have been incremented from the original call, as >> well as an nents that was not the original number of nents >> called when mapped. >> - Similarly in rdma_rw_ctx_signature_init, both sg and prot_sg >> were unmapped with the incorrect number of nents. > > Those bugs should definately get fixed.. I might extract the sgtable > conversion into a stand alone patch to do it. Yes. I can try to split it off myself and send a patch later this week. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 4/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations
On 2021-09-29 4:46 p.m., Jason Gunthorpe wrote: > On Wed, Sep 29, 2021 at 03:30:42PM -0600, Logan Gunthorpe wrote: >> On 2021-09-28 4:05 p.m., Jason Gunthorpe wrote: >> No, that's not a correct reading of the code. Every time there is a new >> pagemap, this code calculates the mapping type and bus offset. If a page >> comes along with a different page map,f it recalculates. This just >> reduces the overhead so that the calculation is done only every time a >> page with a different pgmap comes along and not doing it for every >> single page. > > Each 'struct scatterlist *sg' refers to a range of contiguous pfns > starting at page_to_pfn(sg_page()) and going for approx sg->length/PAGE_SIZE > pfns long. > Ugh, right. A bit contrived for consecutive pages to have different pgmaps and still be next to each other in a DMA transaction. But I guess it is technically possible and should be protected against. > sg_page() returns the first page, but nothing says that sg_page()+1 > has the same pgmap. > > The code in this patch does check the first page of each sg in a > larger sgl. > >>> At least sg_alloc_append_table_from_pages() and probably something in >>> the block world should be updated to not combine struct pages with >>> different pgmaps, and this should be documented in scatterlist.* >>> someplace. >> >> There's no sane place to do this check. The code is designed to support >> mappings with different pgmaps. > > All places that generate compound sg's by aggregating multiple pages > need to include this check along side the check for physical > contiguity. There are not that many places but > sg_alloc_append_table_from_pages() is one of them: Yes. The block layer also does this. I believe a check in page_is_mergable() will be sufficient there. > @@ -470,7 +470,8 @@ int sg_alloc_append_table_from_pages(struct > sg_append_table *sgt_append, > > /* Merge contiguous pages into the last SG */ > prv_len = sgt_append->prv->length; > - while (n_pages && page_to_pfn(pages[0]) == paddr) { > + while (n_pages && page_to_pfn(pages[0]) == paddr && > + sg_page(sgt_append->prv)->pgmap == pages[0]->pgmap) { I don't think it's correct to use pgmap without first checking if it is a zone device page. But your point is taken. I'll try to address this. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
On Wed, Sep 29, 2021 at 03:42:00PM -0600, Logan Gunthorpe wrote: > The main reason is probably this: if we don't use VM_MIXEDMAP, then we > can't set pte_devmap(). I think that is an API limitation in the fault routines.. finish_fault() should set the pte_devmap - eg by passing the PFN_DEV|PFN_MAP somehow through the vma->vm_page_prot to mk_pte() or otherwise signaling do_set_pte() that it should set those PTE bits when it creates the entry. (or there should be a vmf_* helper for this special case, but using the vmf->page seems righter to me) > If we don't set pte_devmap(), then every single page that GUP > processes needs to check if it's a ZONE_DEVICE page and also if it's > a P2PDMA page (thus dereferencing pgmap) in order to satisfy the > requirements of FOLL_PCI_P2PDMA. Definately not suggesting not to set pte_devmap(), only that VM_MIXEDMAP should not be set on VMAs that only contain struct pages. That is an abuse of what it is intended for. At the very least there should be a big comment above the usage explaining that this is just working around a limitation in finish_fault() where it cannot set the PFN_DEV|PFN_MAP bits today. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices
On Wed, Sep 29, 2021 at 03:50:02PM -0600, Logan Gunthorpe wrote: > > > On 2021-09-28 2:02 p.m., Jason Gunthorpe wrote: > > On Thu, Sep 16, 2021 at 05:40:40PM -0600, Logan Gunthorpe wrote: > >> Hi, > >> > >> This patchset continues my work to add userspace P2PDMA access using > >> O_DIRECT NVMe devices. My last posting[1] just included the first 13 > >> patches in this series, but the early P2PDMA cleanup and map_sg error > >> changes from that series have been merged into v5.15-rc1. To address > >> concerns that that series did not add any new functionality, I've added > >> back the userspcae functionality from the original RFC[2] (but improved > >> based on the original feedback). > > > > I really think this is the best series yet, it really looks nice > > overall. I know the sg flag was a bit of a debate at the start, but it > > serves an undeniable purpose and the resulting standard DMA APIs 'just > > working' is really clean. > > Actually, so far, nobody has said anything negative about using the SG flag. > > > There is more possible here, we could also pass the new GUP flag in the > > ib_umem code.. > > Yes, that would be very useful. You might actually prefer to do that then the bio changes to get the infrastructur merged as it seems less "core" Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
On 2021-09-29 5:05 p.m., Jason Gunthorpe wrote: > On Wed, Sep 29, 2021 at 03:42:00PM -0600, Logan Gunthorpe wrote: > >> The main reason is probably this: if we don't use VM_MIXEDMAP, then we >> can't set pte_devmap(). > > I think that is an API limitation in the fault routines.. > > finish_fault() should set the pte_devmap - eg by passing the > PFN_DEV|PFN_MAP somehow through the vma->vm_page_prot to mk_pte() or > otherwise signaling do_set_pte() that it should set those PTE bits > when it creates the entry. > > (or there should be a vmf_* helper for this special case, but using > the vmf->page seems righter to me) I'm not opposed to this. Though I'm not sure what's best here. >> If we don't set pte_devmap(), then every single page that GUP >> processes needs to check if it's a ZONE_DEVICE page and also if it's >> a P2PDMA page (thus dereferencing pgmap) in order to satisfy the >> requirements of FOLL_PCI_P2PDMA. > > Definately not suggesting not to set pte_devmap(), only that > VM_MIXEDMAP should not be set on VMAs that only contain struct > pages. That is an abuse of what it is intended for. > > At the very least there should be a big comment above the usage > explaining that this is just working around a limitation in > finish_fault() where it cannot set the PFN_DEV|PFN_MAP bits today. Is it? Documentation on vmf_insert_mixed() and VM_MIXEDMAP is not good and the intention is not clear. I got the impression that mm people wanted those interfaces used for users of pte_devmap(). device-dax uses these interfaces and as far as I can see it also only contains struct pages (or at least dev_dax_huge_fault() calls pfn_to_page() on every page when VM_FAULT_NOPAGE happens). So it would be nice to get some direction here from mm developers on what they'd prefer. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices
On 2021-09-29 5:21 p.m., Jason Gunthorpe wrote: > On Wed, Sep 29, 2021 at 03:50:02PM -0600, Logan Gunthorpe wrote: >> >> >> On 2021-09-28 2:02 p.m., Jason Gunthorpe wrote: >>> On Thu, Sep 16, 2021 at 05:40:40PM -0600, Logan Gunthorpe wrote: Hi, This patchset continues my work to add userspace P2PDMA access using O_DIRECT NVMe devices. My last posting[1] just included the first 13 patches in this series, but the early P2PDMA cleanup and map_sg error changes from that series have been merged into v5.15-rc1. To address concerns that that series did not add any new functionality, I've added back the userspcae functionality from the original RFC[2] (but improved based on the original feedback). >>> >>> I really think this is the best series yet, it really looks nice >>> overall. I know the sg flag was a bit of a debate at the start, but it >>> serves an undeniable purpose and the resulting standard DMA APIs 'just >>> working' is really clean. >> >> Actually, so far, nobody has said anything negative about using the SG flag. >> >>> There is more possible here, we could also pass the new GUP flag in the >>> ib_umem code.. >> >> Yes, that would be very useful. > > You might actually prefer to do that then the bio changes to get the > infrastructur merged as it seems less "core" I'm a little bit more concerned about my patch set growing too large. It's already at 20 patches and I think I'll need to add a couple more based on the feedback you've already provided. So I'm leaning toward pushing more functionality as future work. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
On Wed, Sep 29, 2021 at 05:27:22PM -0600, Logan Gunthorpe wrote: > > finish_fault() should set the pte_devmap - eg by passing the > > PFN_DEV|PFN_MAP somehow through the vma->vm_page_prot to mk_pte() or > > otherwise signaling do_set_pte() that it should set those PTE bits > > when it creates the entry. > > > > (or there should be a vmf_* helper for this special case, but using > > the vmf->page seems righter to me) > > I'm not opposed to this. Though I'm not sure what's best here. > > >> If we don't set pte_devmap(), then every single page that GUP > >> processes needs to check if it's a ZONE_DEVICE page and also if it's > >> a P2PDMA page (thus dereferencing pgmap) in order to satisfy the > >> requirements of FOLL_PCI_P2PDMA. > > > > Definately not suggesting not to set pte_devmap(), only that > > VM_MIXEDMAP should not be set on VMAs that only contain struct > > pages. That is an abuse of what it is intended for. > > > > At the very least there should be a big comment above the usage > > explaining that this is just working around a limitation in > > finish_fault() where it cannot set the PFN_DEV|PFN_MAP bits today. > > Is it? Documentation on vmf_insert_mixed() and VM_MIXEDMAP is not good > and the intention is not clear. I got the impression that mm people > wanted those interfaces used for users of pte_devmap(). I thought VM_MIXEDMAP was quite clear: #define VM_MIXEDMAP 0x1000 /* Can contain "struct page" and pure PFN pages */ This VMA does not include PFN pages, so it should not be tagged VM_MIXEDMAP. Aside from enabling the special vmf_ API, it only controls some special behavior in vm_normal_page: * VM_MIXEDMAP mappings can likewise contain memory with or without "struct * page" backing, however the difference is that _all_ pages with a struct * page (that is, those where pfn_valid is true) are refcounted and considered * normal pages by the VM. The disadvantage is that pages are refcounted * (which can be slower and simply not an option for some PFNMAP users). The * advantage is that we don't have to follow the strict linearity rule of * PFNMAP mappings in order to support COWable mappings. Which again does not describe this case. > device-dax uses these interfaces and as far as I can see it also only > contains struct pages (or at least dev_dax_huge_fault() calls > pfn_to_page() on every page when VM_FAULT_NOPAGE happens). hacky hacky :) I think DAX probably did it that way for the same reason you are doing it that way - no other choice without changing something Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices
On Wed, Sep 29, 2021 at 05:28:38PM -0600, Logan Gunthorpe wrote: > > > On 2021-09-29 5:21 p.m., Jason Gunthorpe wrote: > > On Wed, Sep 29, 2021 at 03:50:02PM -0600, Logan Gunthorpe wrote: > >> > >> > >> On 2021-09-28 2:02 p.m., Jason Gunthorpe wrote: > >>> On Thu, Sep 16, 2021 at 05:40:40PM -0600, Logan Gunthorpe wrote: > Hi, > > This patchset continues my work to add userspace P2PDMA access using > O_DIRECT NVMe devices. My last posting[1] just included the first 13 > patches in this series, but the early P2PDMA cleanup and map_sg error > changes from that series have been merged into v5.15-rc1. To address > concerns that that series did not add any new functionality, I've added > back the userspcae functionality from the original RFC[2] (but improved > based on the original feedback). > >>> > >>> I really think this is the best series yet, it really looks nice > >>> overall. I know the sg flag was a bit of a debate at the start, but it > >>> serves an undeniable purpose and the resulting standard DMA APIs 'just > >>> working' is really clean. > >> > >> Actually, so far, nobody has said anything negative about using the SG > >> flag. > >> > >>> There is more possible here, we could also pass the new GUP flag in the > >>> ib_umem code.. > >> > >> Yes, that would be very useful. > > > > You might actually prefer to do that then the bio changes to get the > > infrastructur merged as it seems less "core" > > I'm a little bit more concerned about my patch set growing too large. > It's already at 20 patches and I think I'll need to add a couple more > based on the feedback you've already provided. So I'm leaning toward > pushing more functionality as future work. I mean you could postpone the three block related patches and use a single ib_umem patch instead as the consumer. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 4/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations
On Wed, Sep 29, 2021 at 05:00:43PM -0600, Logan Gunthorpe wrote: > > > > On 2021-09-29 4:46 p.m., Jason Gunthorpe wrote: > > On Wed, Sep 29, 2021 at 03:30:42PM -0600, Logan Gunthorpe wrote: > >> On 2021-09-28 4:05 p.m., Jason Gunthorpe wrote: > >> No, that's not a correct reading of the code. Every time there is a new > >> pagemap, this code calculates the mapping type and bus offset. If a page > >> comes along with a different page map,f it recalculates. This just > >> reduces the overhead so that the calculation is done only every time a > >> page with a different pgmap comes along and not doing it for every > >> single page. > > > > Each 'struct scatterlist *sg' refers to a range of contiguous pfns > > starting at page_to_pfn(sg_page()) and going for approx sg->length/PAGE_SIZE > > pfns long. > > > > Ugh, right. A bit contrived for consecutive pages to have different > pgmaps and still be next to each other in a DMA transaction. But I guess > it is technically possible and should be protected against. I worry it is something a hostile userspace could cookup using mmap and cause some kind of kernel integrity problem with. > > @@ -470,7 +470,8 @@ int sg_alloc_append_table_from_pages(struct > > sg_append_table *sgt_append, > > > > /* Merge contiguous pages into the last SG */ > > prv_len = sgt_append->prv->length; > > - while (n_pages && page_to_pfn(pages[0]) == paddr) { > > + while (n_pages && page_to_pfn(pages[0]) == paddr && > > + sg_page(sgt_append->prv)->pgmap == pages[0]->pgmap) { > > I don't think it's correct to use pgmap without first checking if it is > a zone device page. But your point is taken. I'll try to address this. Yes Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA
On Wed, Sep 29, 2021 at 03:57:20PM -0700, Jacob Pan wrote: > Hi Jason, > > On Wed, 29 Sep 2021 16:39:53 -0300, Jason Gunthorpe wrote: > > > On Wed, Sep 29, 2021 at 12:37:19PM -0700, Jacob Pan wrote: > > > > > For #2, it seems we can store the kernel PASID in struct device. This > > > will preserve the DMA API interface while making it PASID capable. > > > Essentially, each PASID capable device would have two special global > > > PASIDs: > > > - PASID 0 for DMA request w/o PASID, aka RID2PASID > > > - PASID 1 (randomly selected) for in-kernel DMA request w/ > > > PASID > > > > This seems reasonable, I had the same thought. Basically just have the > > driver issue some trivial call: > > pci_enable_pasid_dma(pdev, &pasid) > That would work, but I guess it needs to be an iommu_ call instead of pci_? Which ever makes sense.. The API should take in a struct pci_device and return a PCI PASID - at least as a wrapper around a more generic immu api. > I think your suggestion is more precise, in case the driver does not want > to do DMA w/ PASID, we can do less IOTLB flush (PASID 0 only). Since it is odd, and it may create overhead, I would do it only when asked to do it > > Having multiple RID's pointing at the same IO page table is something > > we expect iommufd to require so the whole thing should ideally fall > > out naturally. > That would be the equivalent of attaching multiple devices to the same > IOMMU domain. right? Effectively.. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
On 2021-09-29 5:35 p.m., Jason Gunthorpe wrote: > On Wed, Sep 29, 2021 at 05:27:22PM -0600, Logan Gunthorpe wrote: > >>> finish_fault() should set the pte_devmap - eg by passing the >>> PFN_DEV|PFN_MAP somehow through the vma->vm_page_prot to mk_pte() or >>> otherwise signaling do_set_pte() that it should set those PTE bits >>> when it creates the entry. >>> >>> (or there should be a vmf_* helper for this special case, but using >>> the vmf->page seems righter to me) >> >> I'm not opposed to this. Though I'm not sure what's best here. >> If we don't set pte_devmap(), then every single page that GUP processes needs to check if it's a ZONE_DEVICE page and also if it's a P2PDMA page (thus dereferencing pgmap) in order to satisfy the requirements of FOLL_PCI_P2PDMA. >>> >>> Definately not suggesting not to set pte_devmap(), only that >>> VM_MIXEDMAP should not be set on VMAs that only contain struct >>> pages. That is an abuse of what it is intended for. >>> >>> At the very least there should be a big comment above the usage >>> explaining that this is just working around a limitation in >>> finish_fault() where it cannot set the PFN_DEV|PFN_MAP bits today. >> >> Is it? Documentation on vmf_insert_mixed() and VM_MIXEDMAP is not good >> and the intention is not clear. I got the impression that mm people >> wanted those interfaces used for users of pte_devmap(). > > I thought VM_MIXEDMAP was quite clear: > > #define VM_MIXEDMAP 0x1000 /* Can contain "struct page" and pure > PFN pages */ > > This VMA does not include PFN pages, so it should not be tagged > VM_MIXEDMAP. > > Aside from enabling the special vmf_ API, it only controls some > special behavior in vm_normal_page: > > * VM_MIXEDMAP mappings can likewise contain memory with or without "struct > * page" backing, however the difference is that _all_ pages with a struct > * page (that is, those where pfn_valid is true) are refcounted and considered > * normal pages by the VM. The disadvantage is that pages are refcounted > * (which can be slower and simply not an option for some PFNMAP users). The > * advantage is that we don't have to follow the strict linearity rule of > * PFNMAP mappings in order to support COWable mappings. > > Which again does not describe this case. Some of this seems out of date. Pretty sure the pages are not refcounted with vmf_insert_mixed() and vmf_insert_mixed() is currently the only way to use VM_MIXEDMAP mappings. >> device-dax uses these interfaces and as far as I can see it also only >> contains struct pages (or at least dev_dax_huge_fault() calls >> pfn_to_page() on every page when VM_FAULT_NOPAGE happens). > > hacky hacky :) > > I think DAX probably did it that way for the same reason you are > doing it that way - no other choice without changing something Sure but if you look at other vmf_insert_mixed() (of which there are few) you see similar patterns. Seems more like it was documented with one thing in mind but then used in a completely different manner. Which is why I suggested the documentation was not so good. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices
On 2021-09-29 5:36 p.m., Jason Gunthorpe wrote: > On Wed, Sep 29, 2021 at 05:28:38PM -0600, Logan Gunthorpe wrote: >> >> >> On 2021-09-29 5:21 p.m., Jason Gunthorpe wrote: >>> On Wed, Sep 29, 2021 at 03:50:02PM -0600, Logan Gunthorpe wrote: On 2021-09-28 2:02 p.m., Jason Gunthorpe wrote: > On Thu, Sep 16, 2021 at 05:40:40PM -0600, Logan Gunthorpe wrote: >> Hi, >> >> This patchset continues my work to add userspace P2PDMA access using >> O_DIRECT NVMe devices. My last posting[1] just included the first 13 >> patches in this series, but the early P2PDMA cleanup and map_sg error >> changes from that series have been merged into v5.15-rc1. To address >> concerns that that series did not add any new functionality, I've added >> back the userspcae functionality from the original RFC[2] (but improved >> based on the original feedback). > > I really think this is the best series yet, it really looks nice > overall. I know the sg flag was a bit of a debate at the start, but it > serves an undeniable purpose and the resulting standard DMA APIs 'just > working' is really clean. Actually, so far, nobody has said anything negative about using the SG flag. > There is more possible here, we could also pass the new GUP flag in the > ib_umem code.. Yes, that would be very useful. >>> >>> You might actually prefer to do that then the bio changes to get the >>> infrastructur merged as it seems less "core" >> >> I'm a little bit more concerned about my patch set growing too large. >> It's already at 20 patches and I think I'll need to add a couple more >> based on the feedback you've already provided. So I'm leaning toward >> pushing more functionality as future work. > > I mean you could postpone the three block related patches and use a > single ib_umem patch instead as the consumer. I think that's not a very compelling use case given the only provider of these VMAs is an NVMe block device. My patch set enables a real world use (copying data between NVMe devices P2P through the CMB with O_DIRECT). Being able to read or write a CMB with RDMA and only RDMA is not very compelling. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
On Wed, Sep 29, 2021 at 05:49:36PM -0600, Logan Gunthorpe wrote: > Some of this seems out of date. Pretty sure the pages are not refcounted > with vmf_insert_mixed() and vmf_insert_mixed() is currently the only way > to use VM_MIXEDMAP mappings. Hum. vmf_insert_mixed() boils down to insert_pfn() which always sets the special bit, so vm_normal_page() returns NULL and thus the pages are not freed during zap. So, if the pages are always special and not refcounted all the docs seem really out of date - or rather they describe the situation without the special bit, I think. Why would DAX want to do this in the first place?? This means the address space zap is much more important that just speeding up destruction, it is essential for correctness since the PTEs are not holding refcounts naturally... Sigh. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/8] iommu/vt-d: Clean up unused PASID updating functions
Hi, Baolu, On Wed, Sep 29, 2021 at 03:34:51PM +0800, Lu Baolu wrote: > On 2021/9/21 3:23, Fenghua Yu wrote: > > update_pasid() and its call chain are currently unused in the tree because > > Thomas disabled the ENQCMD feature. The feature will be re-enabled shortly > > using a different approach and update_pasid() and its call chain will not > > be used in the new approach. > > > > Remove the useless functions. > > > > Signed-off-by: Fenghua Yu > > Reviewed-by: Tony Luck > > Thanks for this cleanup. I have queued it for v5.16. Thank you for pushing this patch to 5.16! -Fenghua ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
Hi, Baolu, On Thu, Sep 23, 2021 at 01:43:32PM +0800, Lu Baolu wrote: > On 9/21/21 3:23 AM, Fenghua Yu wrote: > > +void pasid_put(struct task_struct *tsk, struct mm_struct *mm) > > +{ > > + if (!cpu_feature_enabled(X86_FEATURE_ENQCMD)) > > + return; > > + > > + /* > > +* Nothing to do if this task doesn't have a reference to the PASID. > > +*/ > > + if (tsk->has_valid_pasid) { > > + mutex_lock(&pasid_mutex); > > + /* > > +* The PASID's reference was taken during fix up. Release it > > +* now. If the reference count is 0, the PASID is freed. > > +*/ > > + iommu_sva_free_pasid(mm); > > + mutex_unlock(&pasid_mutex); > > + } > > +} > > > > It looks odd that both __fixup_pasid_exception() and pasid_put() are > defined in the vendor IOMMU driver, but get called in the arch/x86 > code. > > Is it feasible to move these two helpers to the files where they are > called? The IA32_PASID MSR fixup and release are not part of the IOMMU > implementation. Sure. The functions will be removed in the next version. And new functions will not be called in IOMMU driver. Thanks. -Fenghua ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 0/2] arm64: retry dropping HAVE_ARCH_PFN_VALID
From: Mike Rapoport Hi, This is a new attempt to drop HAVE_ARCH_PFN_VALID on arm64 and start using the generic implementation of pfn_valid(). The first patch removes the check for pfn_valid() in dma_map_resource() which is required to avoid false positives when there is memory map for MMIO. Anshuman Khandual (1): arm64/mm: drop HAVE_ARCH_PFN_VALID Mike Rapoport (1): dma-mapping: remove bogus test for pfn_valid from dma_map_resource arch/arm64/Kconfig| 1 - arch/arm64/include/asm/page.h | 1 - arch/arm64/mm/init.c | 37 --- kernel/dma/mapping.c | 4 4 files changed, 43 deletions(-) base-commit: 5816b3e6577eaa676ceb00a848f0fd65fe2adc29 -- 2.28.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 1/2] dma-mapping: remove bogus test for pfn_valid from dma_map_resource
From: Mike Rapoport dma_map_resource() uses pfn_valid() to ensure the range is not RAM. However, pfn_valid() only checks for availability of the memory map for a PFN but it does not ensure that the PFN is actually backed by RAM. As dma_map_resource() is the only method in DMA mapping APIs that has this check, simply drop the pfn_valid() test from dma_map_resource(). Link: https://lore.kernel.org/all/20210824173741.gc...@arm.com/ Signed-off-by: Mike Rapoport --- kernel/dma/mapping.c | 4 1 file changed, 4 deletions(-) diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index 06fec5547e7c..dda8d8b84a55 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -296,10 +296,6 @@ dma_addr_t dma_map_resource(struct device *dev, phys_addr_t phys_addr, if (WARN_ON_ONCE(!dev->dma_mask)) return DMA_MAPPING_ERROR; - /* Don't allow RAM to be mapped */ - if (WARN_ON_ONCE(pfn_valid(PHYS_PFN(phys_addr - return DMA_MAPPING_ERROR; - if (dma_map_direct(dev, ops)) addr = dma_direct_map_resource(dev, phys_addr, size, dir, attrs); else if (ops->map_resource) -- 2.28.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 2/2] arm64/mm: drop HAVE_ARCH_PFN_VALID
From: Anshuman Khandual CONFIG_SPARSEMEM_VMEMMAP is now the only available memory model on arm64 platforms and free_unused_memmap() would just return without creating any holes in the memmap mapping. There is no need for any special handling in pfn_valid() and HAVE_ARCH_PFN_VALID can just be dropped. This also moves the pfn upper bits sanity check into generic pfn_valid(). [rppt: rebased on v5.15-rc3] Link: https://lkml.kernel.org/r/1621947349-25421-1-git-send-email-anshuman.khand...@arm.com Signed-off-by: Anshuman Khandual Acked-by: David Hildenbrand Acked-by: Mike Rapoport Cc: Catalin Marinas Cc: Will Deacon Cc: David Hildenbrand Cc: Mike Rapoport Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mike Rapoport --- arch/arm64/Kconfig| 1 - arch/arm64/include/asm/page.h | 1 - arch/arm64/mm/init.c | 37 --- 3 files changed, 39 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 5c7ae4c3954b..53dd23f305be 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -154,7 +154,6 @@ config ARM64 select HAVE_ARCH_KGDB select HAVE_ARCH_MMAP_RND_BITS select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT - select HAVE_ARCH_PFN_VALID select HAVE_ARCH_PREL32_RELOCATIONS select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET select HAVE_ARCH_SECCOMP_FILTER diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h index f98c91bbd7c1..993a27ea6f54 100644 --- a/arch/arm64/include/asm/page.h +++ b/arch/arm64/include/asm/page.h @@ -41,7 +41,6 @@ void tag_clear_highpage(struct page *to); typedef struct page *pgtable_t; -int pfn_valid(unsigned long pfn); int pfn_is_map_memory(unsigned long pfn); #include diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 37a81754d9b6..e60c6eb813b7 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -160,43 +160,6 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max) free_area_init(max_zone_pfns); } -int pfn_valid(unsigned long pfn) -{ - phys_addr_t addr = PFN_PHYS(pfn); - struct mem_section *ms; - - /* -* Ensure the upper PAGE_SHIFT bits are clear in the -* pfn. Else it might lead to false positives when -* some of the upper bits are set, but the lower bits -* match a valid pfn. -*/ - if (PHYS_PFN(addr) != pfn) - return 0; - - if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) - return 0; - - ms = __pfn_to_section(pfn); - if (!valid_section(ms)) - return 0; - - /* -* ZONE_DEVICE memory does not have the memblock entries. -* memblock_is_map_memory() check for ZONE_DEVICE based -* addresses will always fail. Even the normal hotplugged -* memory will never have MEMBLOCK_NOMAP flag set in their -* memblock entries. Skip memblock search for all non early -* memory sections covering all of hotplug memory including -* both normal and ZONE_DEVICE based. -*/ - if (!early_section(ms)) - return pfn_section_valid(ms, pfn); - - return memblock_is_memory(addr); -} -EXPORT_SYMBOL(pfn_valid); - int pfn_is_map_memory(unsigned long pfn) { phys_addr_t addr = PFN_PHYS(pfn); -- 2.28.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
On Wed, Sep 29, 2021 at 09:22:30AM -0300, Jason Gunthorpe wrote: > On Wed, Sep 29, 2021 at 12:46:14PM +1000, da...@gibson.dropbear.id.au wrote: > > On Tue, Sep 21, 2021 at 10:00:14PM -0300, Jason Gunthorpe wrote: > > > On Wed, Sep 22, 2021 at 12:54:02AM +, Tian, Kevin wrote: > > > > > From: Jason Gunthorpe > > > > > Sent: Wednesday, September 22, 2021 12:01 AM > > > > > > > > > > > One open about how to organize the device nodes under > > > > > /dev/vfio/devices/. > > > > > > This RFC adopts a simple policy by keeping a flat layout with mixed > > > > > devname > > > > > > from all kinds of devices. The prerequisite of this model is that > > > > > > devnames > > > > > > from different bus types are unique formats: > > > > > > > > > > This isn't reliable, the devname should just be vfio0, vfio1, etc > > > > > > > > > > The userspace can learn the correct major/minor by inspecting the > > > > > sysfs. > > > > > > > > > > This whole concept should disappear into the prior patch that adds the > > > > > struct device in the first place, and I think most of the code here > > > > > can be deleted once the struct device is used properly. > > > > > > > > > > > > > Can you help elaborate above flow? This is one area where we need > > > > more guidance. > > > > > > > > When Qemu accepts an option "-device vfio-pci,host=:BB:DD.F", > > > > how does Qemu identify which vifo0/1/... is associated with the > > > > specified > > > > :BB:DD.F? > > > > > > When done properly in the kernel the file: > > > > > > /sys/bus/pci/devices/:BB:DD.F/vfio/vfioX/dev > > > > > > Will contain the major:minor of the VFIO device. > > > > > > Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat > > > that the major:minor matches. > > > > > > in the above pattern "pci" and ":BB:DD.FF" are the arguments passed > > > to qemu. > > > > I thought part of the appeal of the device centric model was less > > grovelling around in sysfs for information. Using type/address > > directly in /dev seems simpler than having to dig around matching > > things here. > > I would say more regular grovelling. Starting from a sysfs device > directory and querying the VFIO cdev associated with it is much more > normal than what happens today, which also includes passing sysfs > information into an ioctl :\ Hm.. ok. Clearly I'm unfamiliar with the things that do that. Other than current VFIO, the only model I've really seen is where you just point your program at a device node. > > Note that this doesn't have to be done in kernel: you could have the > > kernel just call them /dev/vfio/devices/vfio0, ... but add udev rules > > that create symlinks from say /dev/vfio/pci/:BB:SS.F - > > > ../devices/vfioXX based on the sysfs information. > > This is the right approach if people want to do this, but I'm not sure > it is worth it given backwards compat requires the sysfs path as > input. You mean for userspace that needs to be able to go back to the old VFIO interface as well? It seems silly to force this sysfs mucking about on new programs that depend on the new interface. > We may as well stick with sysfs as the command line interface > for userspace tools. > And I certainly don't want to see userspace tools trying to reverse a > sysfs path into a /dev/ symlink name when they can directly and > reliably learn the correct cdev from the sysfspath. Um.. sure.. but they can get the correct cdev from the sysfspath no matter how we name the cdevs. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
On Wed, Sep 29, 2021 at 01:05:21PM -0600, Alex Williamson wrote: > On Wed, 29 Sep 2021 12:08:59 +1000 > David Gibson wrote: > > > On Sun, Sep 19, 2021 at 02:38:30PM +0800, Liu Yi L wrote: > > > This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for > > > userspace to directly open a vfio device w/o relying on container/group > > > (/dev/vfio/$GROUP). Anything related to group is now hidden behind > > > iommufd (more specifically in iommu core by this RFC) in a device-centric > > > manner. > > > > > > In case a device is exposed in both legacy and new interfaces (see next > > > patch for how to decide it), this patch also ensures that when the device > > > is already opened via one interface then the other one must be blocked. > > > > > > Signed-off-by: Liu Yi L > > [snip] > > > > > +static bool vfio_device_in_container(struct vfio_device *device) > > > +{ > > > + return !!(device->group && device->group->container); > > > > You don't need !! here. && is already a logical operation, so returns > > a valid bool. > > > > > +} > > > + > > > static int vfio_device_fops_release(struct inode *inode, struct file > > > *filep) > > > { > > > struct vfio_device *device = filep->private_data; > > > @@ -1560,7 +1691,16 @@ static int vfio_device_fops_release(struct inode > > > *inode, struct file *filep) > > > > > > module_put(device->dev->driver->owner); > > > > > > - vfio_group_try_dissolve_container(device->group); > > > + if (vfio_device_in_container(device)) { > > > + vfio_group_try_dissolve_container(device->group); > > > + } else { > > > + atomic_dec(&device->opened); > > > + if (device->group) { > > > + mutex_lock(&device->group->opened_lock); > > > + device->group->opened--; > > > + mutex_unlock(&device->group->opened_lock); > > > + } > > > + } > > > > > > vfio_device_put(device); > > > > > > @@ -1613,6 +1753,7 @@ static int vfio_device_fops_mmap(struct file > > > *filep, struct vm_area_struct *vma) > > > > > > static const struct file_operations vfio_device_fops = { > > > .owner = THIS_MODULE, > > > + .open = vfio_device_fops_open, > > > .release= vfio_device_fops_release, > > > .read = vfio_device_fops_read, > > > .write = vfio_device_fops_write, > > > @@ -2295,6 +2436,52 @@ static struct miscdevice vfio_dev = { > > > .mode = S_IRUGO | S_IWUGO, > > > }; > > > > > > +static char *vfio_device_devnode(struct device *dev, umode_t *mode) > > > +{ > > > + return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev)); > > > > Others have pointed out some problems with the use of dev_name() > > here. I'll add that I think you'll make things much easier if instead > > of using one huge "devices" subdir, you use a separate subdir for each > > vfio sub-driver (so, one for PCI, one for each type of mdev, one for > > platform, etc.). That should make avoiding name conflicts a lot simpler. > > It seems like this is unnecessary if we use the vfioX naming approach. > Conflicts are trivial to ignore if we don't involve dev_name() and > looking for the correct major:minor chardev in the correct subdirectory > seems like a hassle for userspace. Thanks, Right.. it does sound like a hassle, but AFAICT that's *more* necessary with /dev/vfio/vfioXX than with /dev/vfio/pci/:BB:SS.F, since you have to look up a meaningful name in sysfs to find the right devnode. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 09/12] media: mtk-vcodec: Get rid of mtk_smi_larb_get/put
Hi Dafna, Thanks very much for the review. On Wed, 2021-09-29 at 14:13 +0200, Dafna Hirschfeld wrote: > > On 29.09.21 03:37, Yong Wu wrote: > > MediaTek IOMMU has already added the device_link between the > > consumer > > and smi-larb device. If the vcodec device call the > > pm_runtime_get_sync, > > the smi-larb's pm_runtime_get_sync also be called automatically. > > > > CC: Tiffany Lin > > CC: Irui Wang > > Signed-off-by: Yong Wu > > Reviewed-by: Evan Green > > Acked-by: Tiffany Lin > > Reviewed-by: Dafna Hirschfeld > > --- > > .../platform/mtk-vcodec/mtk_vcodec_dec_pm.c | 37 +++--- > > -- > > .../platform/mtk-vcodec/mtk_vcodec_drv.h | 3 -- > > .../platform/mtk-vcodec/mtk_vcodec_enc.c | 1 - > > .../platform/mtk-vcodec/mtk_vcodec_enc_pm.c | 44 +++--- > > - > > 4 files changed, 10 insertions(+), 75 deletions(-) > > > > diff --git a/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c > > b/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c > > index 6038db96f71c..d0bf9aa3b29d 100644 > > --- a/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c > > +++ b/drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c > > @@ -8,14 +8,12 @@ > > #include > > #include > > #include > > -#include > > > > #include "mtk_vcodec_dec_pm.h" > > #include "mtk_vcodec_util.h" > > > > int mtk_vcodec_init_dec_pm(struct mtk_vcodec_dev *mtkdev) > > { > > - struct device_node *node; > > struct platform_device *pdev; > > struct mtk_vcodec_pm *pm; > > struct mtk_vcodec_clk *dec_clk; > > @@ -26,18 +24,7 @@ int mtk_vcodec_init_dec_pm(struct mtk_vcodec_dev > > *mtkdev) > > pm = &mtkdev->pm; > > pm->mtkdev = mtkdev; > > dec_clk = &pm->vdec_clk; > > - node = of_parse_phandle(pdev->dev.of_node, "mediatek,larb", 0); > > - if (!node) { > > - mtk_v4l2_err("of_parse_phandle mediatek,larb fail!"); > > - return -1; > > - } > > > > - pdev = of_find_device_by_node(node); > > - of_node_put(node); > > - if (WARN_ON(!pdev)) { > > - return -1; > > - } > > - pm->larbvdec = &pdev->dev; > > pdev = mtkdev->plat_dev; > > pm->dev = &pdev->dev; > > > > @@ -47,14 +34,11 @@ int mtk_vcodec_init_dec_pm(struct > > mtk_vcodec_dev *mtkdev) > > dec_clk->clk_info = devm_kcalloc(&pdev->dev, > > dec_clk->clk_num, sizeof(*clk_info), > > GFP_KERNEL); > > - if (!dec_clk->clk_info) { > > - ret = -ENOMEM; > > - goto put_device; > > - } > > + if (!dec_clk->clk_info) > > + return -ENOMEM; > > } else { > > mtk_v4l2_err("Failed to get vdec clock count"); > > - ret = -EINVAL; > > - goto put_device; > > + return -EINVAL; > > } > > > > for (i = 0; i < dec_clk->clk_num; i++) { > > @@ -63,29 +47,24 @@ int mtk_vcodec_init_dec_pm(struct > > mtk_vcodec_dev *mtkdev) > > "clock-names", i, &clk_info->clk_name); > > if (ret) { > > mtk_v4l2_err("Failed to get clock name id = > > %d", i); > > - goto put_device; > > + return ret; > > } > > clk_info->vcodec_clk = devm_clk_get(&pdev->dev, > > clk_info->clk_name); > > if (IS_ERR(clk_info->vcodec_clk)) { > > mtk_v4l2_err("devm_clk_get (%d)%s fail", i, > > clk_info->clk_name); > > - ret = PTR_ERR(clk_info->vcodec_clk); > > - goto put_device; > > + return PTR_ERR(clk_info->vcodec_clk); > > } > > } > > > > pm_runtime_enable(&pdev->dev); > > return 0; > > -put_device: > > - put_device(pm->larbvdec); > > - return ret; > > } > > > > void mtk_vcodec_release_dec_pm(struct mtk_vcodec_dev *dev) > > { > > pm_runtime_disable(dev->pm.dev); > > - put_device(dev->pm.larbvdec); > > } > > Now that functions only do 'pm_runtime_disable(dev->pm.dev);' so it > will be more > readable to remove the function mtk_vcodec_release_dec_pm > and replace with pm_runtime_disable(dev->pm.dev); > Same for the 'enc' equivalent. Make sense. But It may be not proper if using pm_runtime_disable as the symmetry with mtk_vcodec_init_dec_pm in the mtk_vcodec_probe. Maybe we should move pm_runtime_enable out from mtk_vcodec_init_dec_pm into mtk_vcodec_probe. I could do a new patch for this. Is this ok for you? > > Thanks, > Dafna [snip] ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH kernel] powerpc/iommu: Report the correct most efficient DMA mask for PCI devices
According to dma-api.rst, the dma_get_required_mask() helper should return "the mask that the platform requires to operate efficiently". Which in the case of PPC64 means the bypass mask and not a mask from an IOMMU table which is shorter and slower to use due to map/unmap operations (especially expensive on "pseries"). However the existing implementation ignores the possibility of bypassing and returns the IOMMU table mask on the pseries platform which makes some drivers (mpt3sas is one example) choose 32bit DMA even though bypass is supported. The powernv platform sort of handles it by having a bigger default window with a mask >=40 but it only works as drivers choose 63/64bit if the required mask is >32 which is rather pointless. This reintroduces the bypass capability check to let drivers make a better choice of the DMA mask. Fixes: f1565c24b596 ("powerpc: use the generic dma_ops_bypass mode") Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kernel/dma-iommu.c | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c index 111249fd619d..d646077bcbcf 100644 --- a/arch/powerpc/kernel/dma-iommu.c +++ b/arch/powerpc/kernel/dma-iommu.c @@ -184,6 +184,14 @@ u64 dma_iommu_get_required_mask(struct device *dev) struct iommu_table *tbl = get_iommu_table_base(dev); u64 mask; + if (dev_is_pci(dev)) { + u64 bypass_mask = dma_direct_get_required_mask(dev); + + if (dma_iommu_dma_supported(dev, bypass_mask)) { + dev_info(dev, "%s: returning bypass mask 0x%llx\n", __func__, bypass_mask); + return bypass_mask; + } + } if (!tbl) return 0; -- 2.30.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
On Wed, Sep 29, 2021 at 06:41:00AM +, Tian, Kevin wrote: > > From: David Gibson > > Sent: Wednesday, September 29, 2021 2:01 PM > > > > On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote: > > > This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind the > > vfio > > > device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface is > > provided > > > because it's implicitly done when the device fd is closed. > > > > > > In concept a vfio device can be bound to multiple iommufds, each hosting > > > a subset of I/O address spaces attached by this device. > > > > I really feel like this many<->many mapping between devices is going > > to be super-confusing, and therefore make it really hard to be > > confident we have all the rules right for proper isolation. > > Based on new discussion on group ownership part (patch06), I feel this > many<->many relationship will disappear. The context fd (either container > or iommufd) will uniquely mark the ownership on a physical device and > its group. With this design it's impractical to have one device bound > to multiple iommufds. Actually I don't think this is a compelling usage > in reality. The previous rationale was that no need to impose such restriction > if no special reason... and now we have a reason. 😊 > > Jason, are you OK with this simplification? > > > > > That's why I was suggesting a concept like endpoints, to break this > > into two many<->one relationships. I'm ok if that isn't visible in > > the user API, but I think this is going to be really hard to keep > > track of if it isn't explicit somewhere in the internals. > > > > I think this endpoint concept is represented by ioas_device_info in > patch14: > > +/* > + * An ioas_device_info object is created per each successful attaching > + * request. A list of objects are maintained per ioas when the address > + * space is shared by multiple devices. > + */ > +struct ioas_device_info { > + struct iommufd_device *idev; > + struct list_head next; > }; > > currently it's 1:1 mapping before this object and iommufd_device, > because no pasid support yet. Ok, I haven't read that far in the series yet. > We can rename it to struct ioas_endpoint if it makes you feel > better. Meh. The concept is much more important than the name. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
On Wed, Sep 29, 2021 at 09:57:16AM -0300, Jason Gunthorpe wrote: > On Wed, Sep 29, 2021 at 04:35:19PM +1000, David Gibson wrote: > > > Yes, exactly. And with a group interface it's obvious it has to > > understand it. With the non-group interface, you can get to this > > stage in ignorance of groups. It will even work as long as you are > > lucky enough only to try with singleton-group devices. Then you try > > it with two devices in the one group and doing (3) on device A will > > implicitly change the DMA environment of device B. > > The security model here says this is fine. I'm not making a statement about the security model, I'm making a statement about surprisingness of the programming interface. In your program you have devices A & B, you perform an operation that specifies only device A and device B changes behaviour. > This idea to put the iommu code in charge of security is quite clean, > as I said in the other mail drivers attached to 'struct devices *' > tell the iommu layer what they are are doing: > >iommu_set_device_dma_owner(dev, DMA_OWNER_KERNEL, NULL) >iommu_set_device_dma_owner(dev, DMA_OWNER_SHARED, NULL) >iommu_set_device_dma_owner(dev, DMA_OWNER_USERSPACE, group_file/iommu_file) > > And it decides if it is allowed. > > If device A is allowed to go to userspace then security wise it is > deemed fine that B is impacted. That is what we have defined already > today. > > This proposal does not free userpace from having to understand this! > The iommu_group sysfs is still there and still must be understood. > > The *admin* the one responsible to understand the groups, not the > applications. The admin has no idea what a group FD is - they should > be looking at the sysfs and seeing the iommu_group directories. Not just the admin. If an app is given two devices in the same group to use *both* it must understand that and act accordingly. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()
On Wed, Sep 29, 2021 at 09:24:57AM -0300, Jason Gunthorpe wrote: 65;6402;1c> On Wed, Sep 29, 2021 at 03:25:54PM +1000, David Gibson wrote: > > > > +struct iommufd_device { > > > + unsigned int id; > > > + struct iommufd_ctx *ictx; > > > + struct device *dev; /* always be the physical device */ > > > + u64 dev_cookie; > > > > Why do you need both an 'id' and a 'dev_cookie'? Since they're both > > unique, couldn't you just use the cookie directly as the index into > > the xarray? > > ID is the kernel value in the xarray - xarray is much more efficient & > safe with small kernel controlled values. > > dev_cookie is a user assigned value that may not be unique. It's > purpose is to allow userspace to receive and event and go back to its > structure. Most likely userspace will store a pointer here, but it is > also possible userspace could not use it. > > It is a pretty normal pattern Hm, ok. Could you point me at an example? -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
On Wed, Sep 29, 2021 at 07:31:08AM +, Tian, Kevin wrote: > > From: David Gibson > > Sent: Wednesday, September 29, 2021 2:35 PM > > > > On Wed, Sep 29, 2021 at 05:38:56AM +, Tian, Kevin wrote: > > > > From: David Gibson > > > > Sent: Wednesday, September 29, 2021 12:56 PM > > > > > > > > > > > > > > Unlike vfio, iommufd adopts a device-centric design with all group > > > > > logistics hidden behind the fd. Binding a device to iommufd serves > > > > > as the contract to get security context established (and vice versa > > > > > for unbinding). One additional requirement in iommufd is to manage > > the > > > > > switch between multiple security contexts due to decoupled > > bind/attach: > > > > > > > > > > 1) Open a device in "/dev/vfio/devices" with user access blocked; > > > > > > > > Probably worth clarifying that (1) must happen for *all* devices in > > > > the group before (2) happens for any device in the group. > > > > > > No. User access is naturally blocked for other devices as long as they > > > are not opened yet. > > > > Uh... my point is that everything in the group has to be removed from > > regular kernel drivers before we reach step (2). Is the plan that you > > must do that before you can even open them? That's a reasonable > > choice, but then I think you should show that step in this description > > as well. > > Agree. I think below proposal can meet above requirement and ensure > it's not broken in the whole process when the group is operated by the > userspace: > > https://lore.kernel.org/kvm/20210928140712.gl964...@nvidia.com/ > > and definitely an updated description will be provided when sending out > the new proposal. > > > > > > > > 2) Bind the device to an iommufd with an initial security context > > > > > (an empty iommu domain which blocks dma) established for its > > > > > group, with user access unblocked; > > > > > > > > > > 3) Attach the device to a user-specified ioasid (shared by all > > > > > devices > > > > > attached to this ioasid). Before attaching, the device should be > > > > > first > > > > > detached from the initial context; > > > > > > > > So, this step can implicitly but observably change the behaviour for > > > > other devices in the group as well. I don't love that kind of > > > > difficult to predict side effect, which is why I'm *still* not totally > > > > convinced by the device-centric model. > > > > > > which side-effect is predicted here? The user anyway needs to be > > > aware of such group restriction regardless whether it uses group > > > or nongroup interface. > > > > Yes, exactly. And with a group interface it's obvious it has to > > understand it. With the non-group interface, you can get to this > > stage in ignorance of groups. It will even work as long as you are > > lucky enough only to try with singleton-group devices. Then you try > > it with two devices in the one group and doing (3) on device A will > > implicitly change the DMA environment of device B. > > for non-group we can also document it obviously in uAPI that the user > must understand group restriction and violating it will get failure > when attaching to different IOAS's for devices in the same group. Documenting limitations is always inferior to building them into the actual API signatures. Sometimes its the only option, but people frequently don't read the docs, whereas they kind of have to look at the API itself. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC v1 01/11] uapi/virtio-iommu: Add page request grp-id and flags information
Hi Jean, On 9/21/21 9:28 PM, Jean-Philippe Brucker wrote: Hi Vivek, Thanks a lot for your work on this Thanks a lot for taking a look at it. I hope that most of the changes in this series and the nested page table series [1] are independent of the presently on-going /dev/iommu proposal, and can be separately reviewed. Please find my comments inline below. On Fri, Apr 23, 2021 at 03:21:37PM +0530, Vivek Gautam wrote: Add fault information for group-id and necessary flags for page request faults that can be handled by page fault handler in virtio-iommu driver. Signed-off-by: Vivek Gautam Cc: Joerg Roedel Cc: Will Deacon Cc: Robin Murphy Cc: Jean-Philippe Brucker Cc: Eric Auger Cc: Alex Williamson Cc: Kevin Tian Cc: Jacob Pan Cc: Liu Yi L Cc: Lorenzo Pieralisi Cc: Shameerali Kolothum Thodi --- include/uapi/linux/virtio_iommu.h | 13 + 1 file changed, 13 insertions(+) diff --git a/include/uapi/linux/virtio_iommu.h b/include/uapi/linux/virtio_iommu.h index f8bf927a0689..accc3318ce46 100644 --- a/include/uapi/linux/virtio_iommu.h +++ b/include/uapi/linux/virtio_iommu.h @@ -307,14 +307,27 @@ struct virtio_iommu_req_invalidate { #define VIRTIO_IOMMU_FAULT_F_DMA_UNRECOV 1 #define VIRTIO_IOMMU_FAULT_F_PAGE_REQ 2 +#define VIRTIO_IOMMU_FAULT_PRQ_F_PASID_VALID (1 << 0) +#define VIRTIO_IOMMU_FAULT_PRQ_F_LAST_PAGE (1 << 1) +#define VIRTIO_IOMMU_FAULT_PRQ_F_PRIV_DATA (1 << 2) +#define VIRTIO_IOMMU_FAULT_PRQ_F_NEEDS_PASID (1 << 3) I don't think this one is necessary here. The NEEDS_PASID flags added by commit 970471914c67 ("iommu: Allow page responses without PASID") mainly helps Linux keep track of things internally. It does tell the fault handler whether to reply with PASID or not, but we don't need that here. The virtio-iommu driver knows whether a PASID is required by looking at the "PRG Response PASID Required" bit in the PCIe capability. For non-PCIe faults (e.g. SMMU stall), I'm guessing we'll need a PROBE property to declare that the endpoint supports recoverable faults anyway, so "PASID required in response" can go through there as well. Sure, I will remove this flag, and rather read the PCIe cap to find out if PASID is required or not. After this series, I will follow up with the non-PCIe fault handling. + +#define VIRTIO_IOMMU_FAULT_UNREC_F_PASID_VALID (1 << 0) +#define VIRTIO_IOMMU_FAULT_UNREC_F_ADDR_VALID (1 << 1) +#define VIRTIO_IOMMU_FAULT_UNREC_F_FETCH_ADDR_VALID(1 << 2) + struct virtio_iommu_fault { __u8reason; __u8reserved[3]; __le16 flt_type; __u8reserved2[2]; + /* flags is actually permission flags */ It's also used for declaring validity of fields. VIRTIO_IOMMU_FAULT_F_ADDRESS already tells whether the address field is valid, so all the other flags introduced by this patch can go in here. Sure, will remove pr_evt_flags field, and move all the flags to 'flags'. __le32 flags; + /* flags for PASID and Page request handling info */ + __le32 pr_evt_flags; __le32 endpoint; __le32 pasid; + __le32 grpid; I'm not sure why we made it 32-bit in Linux UAPI, it's a little wasteful. PCIe PRGI is 9-bits and SMMU STAG is 16-bits. Since the scope of the grpid is the endpoint, 16-bit means 64k in-flight faults per endpoint, which seems more than enough. Right, I will update this to 16-bits field. It won't be okay to update the iommu uAPI now, right? New fields must be appended at the end of the struct, because old drivers will expect to find the 'endpoint' field at this offset. You could remove 'reserved3' while adding 'grpid', to keep the struct layout. Sure, will update this. __u8reserved3[4]; __le64 address; __u8reserved4[8]; So the base structure, currently in the spec, looks like this: struct virtio_iommu_fault { u8 reason; u8 reserved[3]; le32 flags; le32 endpoint; le32 reserved1; le64 address; }; #define VIRTIO_IOMMU_FAULT_F_READ (1 << 0) #define VIRTIO_IOMMU_FAULT_F_WRITE (1 << 1) #define VIRTIO_IOMMU_FAULT_F_ADDRESS(1 << 8) The extended struct could be: struct virtio_iommu_fault { u8 reason; u8 reserved[3]; le32 flags; le32 endpoint; le32 pasid; le64 address; /* Page req
Re: [PATCH kernel] powerpc/iommu: Report the correct most efficient DMA mask for PCI devices
On Thu, Sep 30, 2021 at 01:44:54PM +1000, Alexey Kardashevskiy wrote: > and returns the IOMMU table mask on the pseries platform which makes some > drivers (mpt3sas is one example) choose 32bit DMA even though bypass is > supported. The powernv platform sort of handles it by having a bigger > default window with a mask >=40 but it only works as drivers choose > 63/64bit if the required mask is >32 which is rather pointless. > > This reintroduces the bypass capability check to let drivers make > a better choice of the DMA mask. > > Fixes: f1565c24b596 ("powerpc: use the generic dma_ops_bypass mode") > Signed-off-by: Alexey Kardashevskiy Looks good: Reviewed-by: Christoph Hellwig ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/2] dma-mapping: remove bogus test for pfn_valid from dma_map_resource
On Thu, Sep 30, 2021 at 04:30:38AM +0300, Mike Rapoport wrote: > From: Mike Rapoport > > dma_map_resource() uses pfn_valid() to ensure the range is not RAM. > However, pfn_valid() only checks for availability of the memory map for a > PFN but it does not ensure that the PFN is actually backed by RAM. > > As dma_map_resource() is the only method in DMA mapping APIs that has this > check, simply drop the pfn_valid() test from dma_map_resource(). > > Link: https://lore.kernel.org/all/20210824173741.gc...@arm.com/ > Signed-off-by: Mike Rapoport We'll lose a bit of a sanity check with this, but given the problems it caused: Reviewed-by: Christoph Hellwig ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V5 12/12] net: netvsc: Add Isolation VM support for netvsc driver
On Tue, Sep 28, 2021 at 05:23:31PM +0800, Tianyu Lan wrote: >> >> - the bare memremap usage in swiotlb looks strange and I'd >> definitively expect a well documented wrapper. > > OK. Should the wrapper in the DMA code? How about dma_map_decrypted() > introduced in the V4? A mentioned then the name is a pretty bad choice as it touches the dma_map* namespace that it is not related to. I suspect just a little helper in the swiotlb code that explains how it is used might be enogh for now. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu