RE: [PATCH] dma-mapping: benchmark: Extract a common header file for map_benchmark definition
> -Original Message- > From: tiantao (H) > Sent: Friday, February 11, 2022 4:15 PM > To: Song Bao Hua (Barry Song) ; sh...@kernel.org; > chenxiang (M) > Cc: iommu@lists.linux-foundation.org; linux-kselft...@vger.kernel.org; > linux...@openeuler.org > Subject: [PATCH] dma-mapping: benchmark: Extract a common header file for > map_benchmark definition > > kernel/dma/map_benchmark.c and selftests/dma/dma_map_benchmark.c > have duplicate map_benchmark definitions, which tends to lead to > inconsistent changes to map_benchmark on both sides, extract a > common header file to avoid this problem. > > Signed-off-by: Tian Tao +To: Christoph Looks like a right cleanup. This will help decrease the maintain overhead in the future. Other similar selftests tools are also doing this. Acked-by: Barry Song > --- > kernel/dma/map_benchmark.c| 24 +- > kernel/dma/map_benchmark.h| 31 +++ > .../testing/selftests/dma/dma_map_benchmark.c | 25 +-- > 3 files changed, 33 insertions(+), 47 deletions(-) > create mode 100644 kernel/dma/map_benchmark.h > > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c > index 9b9af1bd6be3..c05f4e242991 100644 > --- a/kernel/dma/map_benchmark.c > +++ b/kernel/dma/map_benchmark.c > @@ -18,29 +18,7 @@ > #include > #include > > -#define DMA_MAP_BENCHMARK_IOWR('d', 1, struct map_benchmark) > -#define DMA_MAP_MAX_THREADS 1024 > -#define DMA_MAP_MAX_SECONDS 300 > -#define DMA_MAP_MAX_TRANS_DELAY (10 * NSEC_PER_MSEC) > - > -#define DMA_MAP_BIDIRECTIONAL0 > -#define DMA_MAP_TO_DEVICE1 > -#define DMA_MAP_FROM_DEVICE 2 > - > -struct map_benchmark { > - __u64 avg_map_100ns; /* average map latency in 100ns */ > - __u64 map_stddev; /* standard deviation of map latency */ > - __u64 avg_unmap_100ns; /* as above */ > - __u64 unmap_stddev; > - __u32 threads; /* how many threads will do map/unmap in parallel */ > - __u32 seconds; /* how long the test will last */ > - __s32 node; /* which numa node this benchmark will run on */ > - __u32 dma_bits; /* DMA addressing capability */ > - __u32 dma_dir; /* DMA data direction */ > - __u32 dma_trans_ns; /* time for DMA transmission in ns */ > - __u32 granule; /* how many PAGE_SIZE will do map/unmap once a time */ > - __u8 expansion[76]; /* For future use */ > -}; > +#include "map_benchmark.h" > > struct map_benchmark_data { > struct map_benchmark bparam; > diff --git a/kernel/dma/map_benchmark.h b/kernel/dma/map_benchmark.h > new file mode 100644 > index ..62674c83bde4 > --- /dev/null > +++ b/kernel/dma/map_benchmark.h > @@ -0,0 +1,31 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > +/* > + * Copyright (C) 2022 HiSilicon Limited. > + */ > + > +#ifndef _KERNEL_DMA_BENCHMARK_H > +#define _KERNEL_DMA_BENCHMARK_H > + > +#define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) > +#define DMA_MAP_MAX_THREADS 1024 > +#define DMA_MAP_MAX_SECONDS 300 > +#define DMA_MAP_MAX_TRANS_DELAY (10 * NSEC_PER_MSEC) > + > +#define DMA_MAP_BIDIRECTIONAL 0 > +#define DMA_MAP_TO_DEVICE 1 > +#define DMA_MAP_FROM_DEVICE 2 > + > +struct map_benchmark { > + __u64 avg_map_100ns; /* average map latency in 100ns */ > + __u64 map_stddev; /* standard deviation of map latency */ > + __u64 avg_unmap_100ns; /* as above */ > + __u64 unmap_stddev; > + __u32 threads; /* how many threads will do map/unmap in parallel */ > + __u32 seconds; /* how long the test will last */ > + __s32 node; /* which numa node this benchmark will run on */ > + __u32 dma_bits; /* DMA addressing capability */ > + __u32 dma_dir; /* DMA data direction */ > + __u32 dma_trans_ns; /* time for DMA transmission in ns */ > + __u32 granule; /* how many PAGE_SIZE will do map/unmap once a time */ > +}; > +#endif /* _KERNEL_DMA_BENCHMARK_H */ > diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c > b/tools/testing/selftests/dma/dma_map_benchmark.c > index 485dff51bad2..33bf073071aa 100644 > --- a/tools/testing/selftests/dma/dma_map_benchmark.c > +++ b/tools/testing/selftests/dma/dma_map_benchmark.c > @@ -11,39 +11,16 @@ > #include > #include > #include > +#include "../../../../kernel/dma/map_benchmark.h" > > #define NSEC_PER_MSEC100L > > -#define DMA_MAP_BENCHMARK_IOWR('d', 1, struct map_benchmark) > -#define DMA_MAP_MAX_THREADS 1024 > -#define DMA_MAP_MAX_SECONDS 300 > -#define DMA_MAP_MAX_TRANS_DELAY (10 * NSEC_PER_M
RE: [PATCH] MAINTAINERS: Update maintainer list of DMA MAPPING BENCHMARK
> -Original Message- > From: chenxiang (M) > Sent: Tuesday, February 8, 2022 8:05 PM > To: Song Bao Hua (Barry Song) ; h...@lst.de; > m.szyprow...@samsung.com; robin.mur...@arm.com > Cc: linux...@openeuler.org; Linuxarm ; > iommu@lists.linux-foundation.org; linux-kselft...@vger.kernel.org; > chenxiang (M) > Subject: [PATCH] MAINTAINERS: Update maintainer list of DMA MAPPING BENCHMARK > > From: Xiang Chen > > Barry Song will not focus on this area, and Xiang Chen will continue his > work to maintain this module. > > Signed-off-by: Xiang Chen Acked-by: Barry Song Xiang has been an user of this module and has made substantial contributions not only to this module and but also to related modules such as iommu/arm-smmu-v3. My this email account will be unreachable in this month. And probably I will rarely work on this module afterwards. So I am happy Xiang will take care of it. Thanks! > --- > MAINTAINERS | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/MAINTAINERS b/MAINTAINERS > index ea3e6c914384..48335022b0e4 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -5765,7 +5765,7 @@ F: include/linux/dma-map-ops.h > F: kernel/dma/ > > DMA MAPPING BENCHMARK > -M: Barry Song > +M: Xiang Chen > L: iommu@lists.linux-foundation.org > F: kernel/dma/map_benchmark.c > F: tools/testing/selftests/dma/ > -- > 2.33.0 Best Regards, Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC PATCH] provide per numa cma with an initial default size
> -Original Message- > From: Robin Murphy [mailto:robin.mur...@arm.com] > Sent: Tuesday, December 7, 2021 4:01 AM > To: Jay Chen ; h...@lst.de; > m.szyprow...@samsung.com; > linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org; Song Bao Hua > (Barry Song) > Cc: zhangligu...@linux.alibaba.com > Subject: Re: [RFC PATCH] provide per numa cma with an initial default size > > [ +Barry ] > > On 2021-11-30 07:45, Jay Chen wrote: > >In the actual production environment, when we open > > cma and per numa cma, if we do not increase the per > > numa size configuration in cmdline, we find that our > > performance has dropped by 20%. > >Through analysis, we found that the default size of > > per numa is 0, which causes the driver to allocate > > memory from cma, which affects performance. Therefore, > > we think we need to provide a default size. > > Looking back at some of the review discussions, I think it may have been > intentional that per-node areas are not allocated by default, since it's > the kind of thing that really wants to be tuned to the particular system > and workload, and as such it seemed reasonable to expect users to > provide a value on the command line if they wanted the feature. That's > certainly what the Kconfig text implies. > > Thanks, > Robin. > > > Signed-off-by: Jay Chen > > --- > > kernel/dma/contiguous.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c > > index 3d63d91cba5c..3bef8bf371d9 100644 > > --- a/kernel/dma/contiguous.c > > +++ b/kernel/dma/contiguous.c > > @@ -99,7 +99,7 @@ early_param("cma", early_cma); > > #ifdef CONFIG_DMA_PERNUMA_CMA > > > > static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES]; > > -static phys_addr_t pernuma_size_bytes __initdata; > > +static phys_addr_t pernuma_size_bytes __initdata = size_bytes; I don't think the size for the default cma can apply to per-numa CMA. We did have some discussion regarding the size when per-numa cma was added, and it was done by a Kconfig option. I think we have decided to not have any default size other than 0. Default size 0 is perfect, this will enforce users to set a proper "cma_pernuma=" bootargs. > > > > static int __init early_cma_pernuma(char *p) > > { > > Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA
On Fri, Oct 8, 2021 at 12:32 AM Jason Gunthorpe wrote: > > On Thu, Oct 07, 2021 at 06:43:33PM +1300, Barry Song wrote: > > > So do we have a case where devices can directly access the kernel's data > > structure such as a list/graph/tree with pointers to a kernel virtual > > address? > > then devices don't need to translate the address of pointers in a structure. > > I assume this is one of the most useful features userspace SVA can provide. > > AFIACT that is the only good case for KVA, but it is also completely > against the endianess, word size and DMA portability design of the > kernel. > > Going there requires some new set of portable APIs for gobally > coherent KVA dma. yep. I agree. it would be very weird if accelerators/gpu are sharing kernel' data struct, but for each "DMA" operation - reading or writing the data struct, we have to call dma_map_single/sg or dma_sync_single_for_cpu/device etc. It seems once devices and cpus are sharing virtual address(SVA), code doesn't need to do explicit map/sync each time. > > Jason Thanks barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA
On Tue, Oct 5, 2021 at 7:21 AM Jason Gunthorpe wrote: > > On Mon, Oct 04, 2021 at 09:40:03AM -0700, Jacob Pan wrote: > > Hi Barry, > > > > On Sat, 2 Oct 2021 01:45:59 +1300, Barry Song <21cn...@gmail.com> wrote: > > > > > > > > > > > I assume KVA mode can avoid this iotlb flush as the device is using > > > > > the page table of the kernel and sharing the whole kernel space. But > > > > > will users be glad to accept this mode? > > > > > > > > You can avoid the lock be identity mapping the physical address space > > > > of the kernel and maping map/unmap a NOP. > > > > > > > > KVA is just a different way to achive this identity map with slightly > > > > different security properties than the normal way, but it doesn't > > > > reach to the same security level as proper map/unmap. > > > > > > > > I'm not sure anyone who cares about DMA security would see value in > > > > the slight difference between KVA and a normal identity map. > > > > > > yes. This is an important question. if users want a high security level, > > > kva might not their choice; if users don't want the security, they are > > > using iommu passthrough. So when will users choose KVA? > > Right, KVAs sit in the middle in terms of performance and security. > > Performance is better than IOVA due to IOTLB flush as you mentioned. Also > > not too far behind of pass-through. > > The IOTLB flush is not on a DMA path but on a vmap path, so it is very > hard to compare the two things.. Maybe vmap can be made to do lazy > IOTLB flush or something and it could be closer > > > Security-wise, KVA respects kernel mapping. So permissions are better > > enforced than pass-through and identity mapping. > > Is this meaningful? Isn't the entire physical map still in the KVA and > isn't it entirely RW ? Some areas are RX, for example, ARCH64 supports KERNEL_TEXT_RDONLY. But the difference is really minor. So do we have a case where devices can directly access the kernel's data structure such as a list/graph/tree with pointers to a kernel virtual address? then devices don't need to translate the address of pointers in a structure. I assume this is one of the most useful features userspace SVA can provide. But do we have a case where accelerators/GPU want to use the complex data structures of kernel drivers? > > Jason Thanks barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA
On Wed, Sep 22, 2021 at 5:14 PM Jacob Pan wrote: > > Hi Joerg/Jason/Christoph et all, > > The current in-kernel supervisor PASID support is based on the SVM/SVA > machinery in sva-lib. Kernel SVA is achieved by extending a special flag > to indicate the binding of the device and a page table should be performed > on init_mm instead of the mm of the current process.Page requests and other > differences between user and kernel SVA are handled as special cases. > > This unrestricted binding with the kernel page table is being challenged > for security and the convention that in-kernel DMA must be compatible with > DMA APIs. > (https://lore.kernel.org/linux-iommu/20210511194726.gp1002...@nvidia.com/) > There is also the lack of IOTLB synchronization upon kernel page table > updates. > > This patchset is trying to address these concerns by having an explicit DMA > API compatible model while continue to support in-kernel use of DMA requests > with PASID. Specifically, the following DMA-IOMMU APIs are introduced: > > int iommu_dma_pasid_enable/disable(struct device *dev, >struct iommu_domain **domain, >enum iommu_dma_pasid_mode mode); > int iommu_map/unmap_kva(struct iommu_domain *domain, > void *cpu_addr,size_t size, int prot); > > The following three addressing modes are supported with example API usages > by device drivers. > > 1. Physical address (bypass) mode. Similar to DMA direct where trusted devices > can DMA pass through IOMMU on a per PASID basis. > Example: > pasid = iommu_dma_pasid_enable(dev, NULL, IOMMU_DMA_PASID_BYPASS); > /* Use the returning PASID and PA for work submission */ > > 2. IOVA mode. DMA API compatible. Map a supervisor PASID the same way as the > PCI requester ID (RID) > Example: > pasid = iommu_dma_pasid_enable(dev, NULL, IOMMU_DMA_PASID_IOVA); > /* Use the PASID and DMA API allocated IOVA for work submission */ Hi Jacob, might be stupid question, what is the performance benefit of this IOVA mode comparing with the current dma_map/unmap_single/sg API which have enabled IOMMU like drivers/iommu/arm/arm-smmu-v3? Do we still need to flush IOTLB by sending commands to IOMMU each time while doing dma_unmap? > > 3. KVA mode. New kva map/unmap APIs. Support fast and strict sub-modes > transparently based on device trustfulness. > Example: > pasid = iommu_dma_pasid_enable(dev, &domain, IOMMU_DMA_PASID_KVA); > iommu_map_kva(domain, &buf, size, prot); > /* Use the returned PASID and KVA to submit work */ > Where: > Fast mode: Shared CPU page tables for trusted devices only > Strict mode: IOMMU domain returned for the untrusted device to > replicate KVA-PA mapping in IOMMU page tables. a huge bottleneck of IOMMU we have seen before is that dma_unmap will require IOTLB flush, for example, in arm_smmu_cmdq_issue_cmdlist(), we are having serious contention on acquiring lock and delay on waiting for iotlb flush completion in arm_smmu_cmdq_poll_until_sync() while multi-threads run. I assume KVA mode can avoid this iotlb flush as the device is using the page table of the kernel and sharing the whole kernel space. But will users be glad to accept this mode? It seems users are enduring the performance decrease of IOVA mapping and unmapping because it has better security. dma operations can only run on some specific dma buffers which have been mapped in the current dma-map/unmap with IOMMU backend. some drivers are using bouncing buffer to overcome the performance loss of dma_map/unmap as copying is faster than unmapping: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=907676b130711fd1f BTW, we have been debugging on dma_map/unmap performance by this benchmark: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/dma/map_benchmark.c you might be able to use it for your benchmarking as well :-) > > On a per device basis, DMA address and performance modes are enabled by the > device drivers. Platform information such as trustability, user command line > input (not included in this set) could also be taken into consideration (not > implemented in this RFC). > > This RFC is intended to communicate the API directions. Little testing is done > outside IDXD and DMA engine tests. > > For PA and IOVA modes, the implementation is straightforward and tested with > Intel IDXD driver. But several opens remain in KVA fast mode thus not tested: > 1. Lack of IOTLB synchronization, kernel direct map alias can be updated as a > result of module loading/eBPF load. Adding kernel mmu notifier? > 2. The use of the auxiliary domain for KVA map, will aux domain stay in the > long term? Is there another way to represent sub-device granu isolation? > 3. Is limiting the KVA sharing to the direct map range reasonable and > practical for all architectures? > > > Many thanks to Ashok Raj, Kevin
Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA
On Sat, Oct 2, 2021 at 1:36 AM Jason Gunthorpe wrote: > > On Sat, Oct 02, 2021 at 01:24:54AM +1300, Barry Song wrote: > > > I assume KVA mode can avoid this iotlb flush as the device is using > > the page table of the kernel and sharing the whole kernel space. But > > will users be glad to accept this mode? > > You can avoid the lock be identity mapping the physical address space > of the kernel and maping map/unmap a NOP. > > KVA is just a different way to achive this identity map with slightly > different security properties than the normal way, but it doesn't > reach to the same security level as proper map/unmap. > > I'm not sure anyone who cares about DMA security would see value in > the slight difference between KVA and a normal identity map. yes. This is an important question. if users want a high security level, kva might not their choice; if users don't want the security, they are using iommu passthrough. So when will users choose KVA? > > > which have been mapped in the current dma-map/unmap with IOMMU backend. > > some drivers are using bouncing buffer to overcome the performance loss of > > dma_map/unmap as copying is faster than unmapping: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=907676b130711fd1f > > It is pretty unforuntate that drivers are hard coding behaviors based > on assumptions of what the portable API is doing under the covers. not real when it has a tx_copybreak which can be set by ethtool or similar userspace tools . if users are using iommu passthrough, copying won't happen by the default tx_copybreak. if users are using restrict iommu mode, socket buffers are copied into the buffers allocated and mapped in the driver. so this won't require mapping and unmapping socket buffers frequently. > > Jason Thanks barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH] dma-mapping: benchmark: use the correct HiSilicon copyright
> -Original Message- > From: fanghao (A) > Sent: Tuesday, March 30, 2021 7:34 PM > To: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com; Song Bao Hua > (Barry Song) > Cc: iommu@lists.linux-foundation.org; linux...@openeuler.org; > linux-kselft...@vger.kernel.org; fanghao (A) > Subject: [PATCH] dma-mapping: benchmark: use the correct HiSilicon copyright > > s/Hisilicon/HiSilicon/g. > It should use capital S, according to > https://www.hisilicon.com/en/terms-of-use. > My bad. Thanks. Acked-by: Barry Song > Signed-off-by: Hao Fang > --- > kernel/dma/map_benchmark.c | 2 +- > tools/testing/selftests/dma/dma_map_benchmark.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c > index e0e64f8..00d6549 100644 > --- a/kernel/dma/map_benchmark.c > +++ b/kernel/dma/map_benchmark.c > @@ -1,6 +1,6 @@ > // SPDX-License-Identifier: GPL-2.0-only > /* > - * Copyright (C) 2020 Hisilicon Limited. > + * Copyright (C) 2020 HiSilicon Limited. > */ > > #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c > b/tools/testing/selftests/dma/dma_map_benchmark.c > index fb23ce9..b492bed 100644 > --- a/tools/testing/selftests/dma/dma_map_benchmark.c > +++ b/tools/testing/selftests/dma/dma_map_benchmark.c > @@ -1,6 +1,6 @@ > // SPDX-License-Identifier: GPL-2.0-only > /* > - * Copyright (C) 2020 Hisilicon Limited. > + * Copyright (C) 2020 HiSilicon Limited. > */ > > #include > -- > 2.8.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH] dma-mapping: make map_benchmark compile into module
> -Original Message- > From: Christoph Hellwig [mailto:h...@lst.de] > Sent: Wednesday, March 24, 2021 8:13 PM > To: tiantao (H) > Cc: a...@linux-foundation.org; pet...@infradead.org; paul...@kernel.org; > a...@kernel.org; t...@linutronix.de; rost...@goodmis.org; h...@lst.de; > m.szyprow...@samsung.com; Song Bao Hua (Barry Song) > ; iommu@lists.linux-foundation.org; > linux-ker...@vger.kernel.org > Subject: Re: [PATCH] dma-mapping: make map_benchmark compile into module > > On Wed, Mar 24, 2021 at 10:17:38AM +0800, Tian Tao wrote: > > under some scenarios, it is necessary to compile map_benchmark > > into module to test iommu, so this patch changed Kconfig and > > export_symbol to implement map_benchmark compiled into module. > > > > On the other hand, map_benchmark is a driver, which is supposed > > to be able to run as a module. > > > > Signed-off-by: Tian Tao > > Nope, we're not going to export more kthread internals for a test > module. The requirement comes from an colleague who is frequently changing the map-bench code for some customized test purpose. and he doesn't want to build kernel image and reboot every time. So I moved the requirement to Tao Tian. Right now, kthread_bind() is exported, kthread_bind_mask() seems to be a little bit "internal" as you said, maybe a wrapper like kthread_bind_node() won't be that "internal", comparing to exposing the cpumask? Anyway, we don't find other driver users for this, hardly I can convince you it is worth. Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH] dma-mapping: make map_benchmark compile into module
> -Original Message- > From: tiantao (H) > Sent: Wednesday, March 24, 2021 3:18 PM > To: a...@linux-foundation.org; pet...@infradead.org; paul...@kernel.org; > a...@kernel.org; t...@linutronix.de; rost...@goodmis.org; h...@lst.de; > m.szyprow...@samsung.com; Song Bao Hua (Barry Song) > > Cc: iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org; tiantao > (H) > Subject: [PATCH] dma-mapping: make map_benchmark compile into module > > under some scenarios, it is necessary to compile map_benchmark > into module to test iommu, so this patch changed Kconfig and > export_symbol to implement map_benchmark compiled into module. > > On the other hand, map_benchmark is a driver, which is supposed > to be able to run as a module. > > Signed-off-by: Tian Tao > --- Acked-by: Barry Song Look sensible to me. I like the idea that map_benchmark is a driver. It seems unreasonable to always require built-in. > kernel/dma/Kconfig | 2 +- > kernel/kthread.c | 1 + > 2 files changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig > index 77b4055..0468293 100644 > --- a/kernel/dma/Kconfig > +++ b/kernel/dma/Kconfig > @@ -223,7 +223,7 @@ config DMA_API_DEBUG_SG > If unsure, say N. > > config DMA_MAP_BENCHMARK > - bool "Enable benchmarking of streaming DMA mapping" > + tristate "Enable benchmarking of streaming DMA mapping" > depends on DEBUG_FS > help > Provides /sys/kernel/debug/dma_map_benchmark that helps with testing > diff --git a/kernel/kthread.c b/kernel/kthread.c > index 1578973..fa4736f 100644 > --- a/kernel/kthread.c > +++ b/kernel/kthread.c > @@ -455,6 +455,7 @@ void kthread_bind_mask(struct task_struct *p, const struct > cpumask *mask) > { > __kthread_bind_mask(p, mask, TASK_UNINTERRUPTIBLE); > } > +EXPORT_SYMBOL(kthread_bind_mask); > > /** > * kthread_bind - bind a just-created kthread to a cpu. > -- > 2.7.4 Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH] dma-mapping: benchmark: Add support for multi-pages map/unmap
> -Original Message- > From: chenxiang (M) > Sent: Thursday, March 18, 2021 10:30 PM > To: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com; Song Bao Hua > (Barry Song) > Cc: iommu@lists.linux-foundation.org; linux...@openeuler.org; > linux-kselft...@vger.kernel.org; chenxiang (M) > Subject: [PATCH] dma-mapping: benchmark: Add support for multi-pages map/unmap > > From: Xiang Chen > > Currently it only support one page map/unmap once a time for dma-map > benchmark, but there are some other scenaries which need to support for > multi-page map/unmap: for those multi-pages interfaces such as > dma_alloc_coherent() and dma_map_sg(), the time spent on multi-pages > map/unmap is not the time of a single page * npages (not linear) as it > may use block description instead of page description when it is satified > with the size such as 2M/1G, and also it can send a single TLB invalidation > command to invalidate multi-pages instead of multi-times when RIL is > enabled (which will short the time of unmap). So it is necessary to add > support for multi-pages map/unmap. > > Add a parameter "-g" to support multi-pages map/unmap. > > Signed-off-by: Xiang Chen > --- Acked-by: Barry Song > kernel/dma/map_benchmark.c | 21 ++--- > tools/testing/selftests/dma/dma_map_benchmark.c | 20 > 2 files changed, 30 insertions(+), 11 deletions(-) > > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c > index e0e64f8..a5c1b01 100644 > --- a/kernel/dma/map_benchmark.c > +++ b/kernel/dma/map_benchmark.c > @@ -38,7 +38,8 @@ struct map_benchmark { > __u32 dma_bits; /* DMA addressing capability */ > __u32 dma_dir; /* DMA data direction */ > __u32 dma_trans_ns; /* time for DMA transmission in ns */ > - __u8 expansion[80]; /* For future use */ > + __u32 granule; /* how many PAGE_SIZE will do map/unmap once a time */ > + __u8 expansion[76]; /* For future use */ > }; > > struct map_benchmark_data { > @@ -58,9 +59,11 @@ static int map_benchmark_thread(void *data) > void *buf; > dma_addr_t dma_addr; > struct map_benchmark_data *map = data; > + int npages = map->bparam.granule; > + u64 size = npages * PAGE_SIZE; > int ret = 0; > > - buf = (void *)__get_free_page(GFP_KERNEL); > + buf = alloc_pages_exact(size, GFP_KERNEL); > if (!buf) > return -ENOMEM; > > @@ -76,10 +79,10 @@ static int map_benchmark_thread(void *data) >* 66 means evertything goes well! 66 is lucky. >*/ > if (map->dir != DMA_FROM_DEVICE) > - memset(buf, 0x66, PAGE_SIZE); > + memset(buf, 0x66, size); > > map_stime = ktime_get(); > - dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE, map->dir); > + dma_addr = dma_map_single(map->dev, buf, size, map->dir); > if (unlikely(dma_mapping_error(map->dev, dma_addr))) { > pr_err("dma_map_single failed on %s\n", > dev_name(map->dev)); > @@ -93,7 +96,7 @@ static int map_benchmark_thread(void *data) > ndelay(map->bparam.dma_trans_ns); > > unmap_stime = ktime_get(); > - dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, map->dir); > + dma_unmap_single(map->dev, dma_addr, size, map->dir); > unmap_etime = ktime_get(); > unmap_delta = ktime_sub(unmap_etime, unmap_stime); > > @@ -112,7 +115,7 @@ static int map_benchmark_thread(void *data) > } > > out: > - free_page((unsigned long)buf); > + free_pages_exact(buf, size); > return ret; > } > > @@ -203,7 +206,6 @@ static long map_benchmark_ioctl(struct file *file, > unsigned > int cmd, > struct map_benchmark_data *map = file->private_data; > void __user *argp = (void __user *)arg; > u64 old_dma_mask; > - > int ret; > > if (copy_from_user(&map->bparam, argp, sizeof(map->bparam))) > @@ -234,6 +236,11 @@ static long map_benchmark_ioctl(struct file *file, > unsigned > int cmd, > return -EINVAL; > } > > + if (map->bparam.granule < 1 || map->bparam.granule > 1024) { > + pr_err("invalid granule size\n"); > + return -EINVAL; > + } > + > switch (map->bparam.dma_dir) { > case DMA_MAP_BIDIRECTIONAL: >
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin
> -Original Message- > From: Jason Gunthorpe [mailto:j...@ziepe.ca] > Sent: Thursday, February 11, 2021 7:04 AM > To: Song Bao Hua (Barry Song) > Cc: David Hildenbrand ; Wangzhou (B) > ; linux-ker...@vger.kernel.org; > iommu@lists.linux-foundation.org; linux...@kvack.org; > linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew > Morton ; Alexander Viro ; > gre...@linuxfoundation.org; kevin.t...@intel.com; jean-phili...@linaro.org; > eric.au...@redhat.com; Liguozhu (Kenneth) ; > zhangfei@linaro.org; chensihang (A) > Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory > pin > > On Tue, Feb 09, 2021 at 10:22:47PM +, Song Bao Hua (Barry Song) wrote: > > > The problem is that SVA declares we can use any memory of a process > > to do I/O. And in real scenarios, we are unable to customize most > > applications to make them use the pool. So we are looking for some > > extension generically for applications such as Nginx, Ceph. > > But those applications will suffer jitter even if their are using CPU > to do the same work. I fail to see why adding an accelerator suddenly > means the application owner will care about jitter introduced by > migration/etc. The only point for this is that when migration occurs on the accelerator, the impact/jitter is much bigger than it does on CPU. Then the accelerator might be unhelpful. > > Again in proper SVA it should be quite unlikely to take a fault caused > by something like migration, on the same likelyhood as the CPU. If > things are faulting so much this is a problem then I think it is a > system level problem with doing too much page motion. My point is that single one SVA application shouldn't require system to make global changes, such as disabling numa balancing, disabling THP, to decrease page fault frequency by affecting other applications. Anyway, guys are in lunar new year. Hopefully, we are getting more real benchmark data afterwards to make the discussion more targeted. > > Jason Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin
> -Original Message- > From: Jason Gunthorpe [mailto:j...@ziepe.ca] > Sent: Wednesday, February 10, 2021 2:54 AM > To: Song Bao Hua (Barry Song) > Cc: David Hildenbrand ; Wangzhou (B) > ; linux-ker...@vger.kernel.org; > iommu@lists.linux-foundation.org; linux...@kvack.org; > linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew > Morton ; Alexander Viro ; > gre...@linuxfoundation.org; kevin.t...@intel.com; jean-phili...@linaro.org; > eric.au...@redhat.com; Liguozhu (Kenneth) ; > zhangfei@linaro.org; chensihang (A) > Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory > pin > > On Tue, Feb 09, 2021 at 03:01:42AM +, Song Bao Hua (Barry Song) wrote: > > > On the other hand, wouldn't it be the benefit of hardware accelerators > > to have a lower and more stable latency zip/encryption than CPU? > > No, I don't think so. Fortunately or unfortunately, I think my people have this target to have a lower-latency and more stable zip/encryption by using accelerators, otherwise, they are going to use CPU directly if there is no advantage of accelerators. > > If this is an important problem then it should apply equally to CPU > and IO jitter. > > Honestly I find the idea that occasional migration jitters CPU and DMA > to not be very compelling. Such specialized applications should > allocate special pages to avoid this, not adding an API to be able to > lock down any page That is exactly what we have done to provide a hugeTLB pool so that applications can allocate memory from this pool. +---+ | | |applications using accelerators| +---+ alloc from pool free to pool + ++ | | | | | | | | | | | | | | +--+---+-+ || || | HugeTLB memory pool | || || ++ The problem is that SVA declares we can use any memory of a process to do I/O. And in real scenarios, we are unable to customize most applications to make them use the pool. So we are looking for some extension generically for applications such as Nginx, Ceph. I am also thinking about leveraging vm.compact_unevictable_allowed which David suggested and making an extension on it, for example, permit users to disable compaction and numa balancing on unevictable pages of SVA process, which might be a smaller deal. > > Jason Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin
> -Original Message- > From: Jason Gunthorpe [mailto:j...@ziepe.ca] > Sent: Tuesday, February 9, 2021 10:30 AM > To: Song Bao Hua (Barry Song) > Cc: David Hildenbrand ; Wangzhou (B) > ; linux-ker...@vger.kernel.org; > iommu@lists.linux-foundation.org; linux...@kvack.org; > linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew > Morton ; Alexander Viro ; > gre...@linuxfoundation.org; kevin.t...@intel.com; jean-phili...@linaro.org; > eric.au...@redhat.com; Liguozhu (Kenneth) ; > zhangfei@linaro.org; chensihang (A) > Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory > pin > > On Mon, Feb 08, 2021 at 08:35:31PM +, Song Bao Hua (Barry Song) wrote: > > > > > > > From: Jason Gunthorpe [mailto:j...@ziepe.ca] > > > Sent: Tuesday, February 9, 2021 7:34 AM > > > To: David Hildenbrand > > > Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org; > > > iommu@lists.linux-foundation.org; linux...@kvack.org; > > > linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew > > > Morton ; Alexander Viro > ; > > > gre...@linuxfoundation.org; Song Bao Hua (Barry Song) > > > ; kevin.t...@intel.com; > > > jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth) > > > ; zhangfei@linaro.org; chensihang (A) > > > > > > Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide > > > memory > > > pin > > > > > > On Mon, Feb 08, 2021 at 09:14:28AM +0100, David Hildenbrand wrote: > > > > > > > People are constantly struggling with the effects of long term pinnings > > > > under user space control, like we already have with vfio and RDMA. > > > > > > > > And here we are, adding yet another, easier way to mess with core MM in > the > > > > same way. This feels like a step backwards to me. > > > > > > Yes, this seems like a very poor candidate to be a system call in this > > > format. Much too narrow, poorly specified, and possibly security > > > implications to allow any process whatsoever to pin memory. > > > > > > I keep encouraging people to explore a standard shared SVA interface > > > that can cover all these topics (and no, uaccel is not that > > > interface), that seems much more natural. > > > > > > I still haven't seen an explanation why DMA is so special here, > > > migration and so forth jitter the CPU too, environments that care > > > about jitter have to turn this stuff off. > > > > This paper has a good explanation: > > https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7482091 > > > > mainly because page fault can go directly to the CPU and we have > > many CPUs. But IO Page Faults go a different way, thus mean much > > higher latency 3-80x slower than page fault: > > events in hardware queue -> Interrupts -> cpu processing page fault > > -> return events to iommu/device -> continue I/O. > > The justifications for this was migration scenarios and migration is > short. If you take a fault on what you are migrating only then does it > slow down the CPU. I agree this can slow down CPU, but not as much as IO page fault. On the other hand, wouldn't it be the benefit of hardware accelerators to have a lower and more stable latency zip/encryption than CPU? > > Are you also working with HW where the IOMMU becomes invalidated after > a migration and doesn't reload? > > ie not true SVA but the sort of emulated SVA we see in a lot of > places? Yes. It is true SVA not emulated SVA. > > It would be much better to work improve that to have closer sync with the > CPU page table than to use pinning. Absolutely I agree improving IOPF and making IOPF catch up with the performance of page fault is the best way. but it would take much long time to optimize both HW and SW. While waiting for them to mature, probably some way which can minimize IOPF should be used to take the responsivity. > > Jason Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin
> -Original Message- > From: David Hildenbrand [mailto:da...@redhat.com] > Sent: Monday, February 8, 2021 11:37 PM > To: Song Bao Hua (Barry Song) ; Matthew Wilcox > > Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org; > iommu@lists.linux-foundation.org; linux...@kvack.org; > linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew > Morton ; Alexander Viro ; > gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com; > jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth) > ; zhangfei@linaro.org; chensihang (A) > > Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory > pin > > On 08.02.21 11:13, Song Bao Hua (Barry Song) wrote: > > > > > >> -Original Message- > >> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf > Of > >> David Hildenbrand > >> Sent: Monday, February 8, 2021 9:22 PM > >> To: Song Bao Hua (Barry Song) ; Matthew Wilcox > >> > >> Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org; > >> iommu@lists.linux-foundation.org; linux...@kvack.org; > >> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew > >> Morton ; Alexander Viro > ; > >> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com; > >> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth) > >> ; zhangfei@linaro.org; chensihang (A) > >> > >> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory > >> pin > >> > >> On 08.02.21 03:27, Song Bao Hua (Barry Song) wrote: > >>> > >>> > >>>> -Original Message- > >>>> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On > Behalf > >> Of > >>>> Matthew Wilcox > >>>> Sent: Monday, February 8, 2021 2:31 PM > >>>> To: Song Bao Hua (Barry Song) > >>>> Cc: Wangzhou (B) ; > linux-ker...@vger.kernel.org; > >>>> iommu@lists.linux-foundation.org; linux...@kvack.org; > >>>> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew > >>>> Morton ; Alexander Viro > >> ; > >>>> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com; > >>>> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth) > >>>> ; zhangfei@linaro.org; chensihang (A) > >>>> > >>>> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide > >>>> memory > >>>> pin > >>>> > >>>> On Sun, Feb 07, 2021 at 10:24:28PM +, Song Bao Hua (Barry Song) > >>>> wrote: > >>>>>>> In high-performance I/O cases, accelerators might want to perform > >>>>>>> I/O on a memory without IO page faults which can result in > >>>>>>> dramatically > >>>>>>> increased latency. Current memory related APIs could not achieve this > >>>>>>> requirement, e.g. mlock can only avoid memory to swap to backup > >>>>>>> device, > >>>>>>> page migration can still trigger IO page fault. > >>>>>> > >>>>>> Well ... we have two requirements. The application wants to not take > >>>>>> page faults. The system wants to move the application to a different > >>>>>> NUMA node in order to optimise overall performance. Why should the > >>>>>> application's desires take precedence over the kernel's desires? And > why > >>>>>> should it be done this way rather than by the sysadmin using numactl > to > >>>>>> lock the application to a particular node? > >>>>> > >>>>> NUMA balancer is just one of many reasons for page migration. Even one > >>>>> simple alloc_pages() can cause memory migration in just single NUMA > >>>>> node or UMA system. > >>>>> > >>>>> The other reasons for page migration include but are not limited to: > >>>>> * memory move due to CMA > >>>>> * memory move due to huge pages creation > >>>>> > >>>>> Hardly we can ask users to disable the COMPACTION, CMA and Huge Page > >>>>> in the whole system. > >>>> > >>>> You're dodging the question. Should the CMA allocation fail because > >>>> another appl
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin
> -Original Message- > From: Jason Gunthorpe [mailto:j...@ziepe.ca] > Sent: Tuesday, February 9, 2021 7:34 AM > To: David Hildenbrand > Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org; > iommu@lists.linux-foundation.org; linux...@kvack.org; > linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew > Morton ; Alexander Viro ; > gre...@linuxfoundation.org; Song Bao Hua (Barry Song) > ; kevin.t...@intel.com; > jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth) > ; zhangfei@linaro.org; chensihang (A) > > Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory > pin > > On Mon, Feb 08, 2021 at 09:14:28AM +0100, David Hildenbrand wrote: > > > People are constantly struggling with the effects of long term pinnings > > under user space control, like we already have with vfio and RDMA. > > > > And here we are, adding yet another, easier way to mess with core MM in the > > same way. This feels like a step backwards to me. > > Yes, this seems like a very poor candidate to be a system call in this > format. Much too narrow, poorly specified, and possibly security > implications to allow any process whatsoever to pin memory. > > I keep encouraging people to explore a standard shared SVA interface > that can cover all these topics (and no, uaccel is not that > interface), that seems much more natural. > > I still haven't seen an explanation why DMA is so special here, > migration and so forth jitter the CPU too, environments that care > about jitter have to turn this stuff off. This paper has a good explanation: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7482091 mainly because page fault can go directly to the CPU and we have many CPUs. But IO Page Faults go a different way, thus mean much higher latency 3-80x slower than page fault: events in hardware queue -> Interrupts -> cpu processing page fault -> return events to iommu/device -> continue I/O. Copied from the paper: If the IOMMU's page table walker fails to find the desired translation in the page table, it sends an ATS response to the GPU notifying it of this failure. This in turn corresponds to a page fault. In response, the GPU sends another request to the IOMMU called a Peripheral Page Request (PPR). The IOMMU places this request in a memory-mapped queue and raises an interrupt on the CPU. Multiple PPR requests can be queued before the CPU is interrupted. The OS must have a suitable IOMMU driver to process this interrupt and the queued PPR requests. In Linux, while in an interrupt context, the driver pulls PPR requests from the queue and places them in a work-queue for later processing. Presumably this design decision was made to minimize the time spent executing in an interrupt context, where lower priority interrupts would be dis-abled. At a later time, an OS worker-thread calls back into the driver to process page fault requests in the work-queue. Once the requests are serviced, the driver notifies the IOMMU. In turn, the IOMMU notifies the GPU. The GPU then sends an-other ATS request to retry the translation for the original fault-ing address. Comparison with CPU: On the CPU, a hardware excep-tion is raised on a page fault, which immediately switches to the OS. In most cases in Linux, this routine services the page fault directly, instead of queuing it for later processing. Con-trast this with a page fault from an accelerator, where the IOMMU has to interrupt the CPU to request service on its be-half, and also note the several back-and-forth messages be-tween the accelerator, the IOMMU, and the CPU. Further-more, page faults on the CPU are generally handled one at a time on the CPU, while for the GPU they are batched by the IOMMU and OS work-queue mechanism. > > Jason Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin
> -Original Message- > From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of > David Hildenbrand > Sent: Monday, February 8, 2021 9:22 PM > To: Song Bao Hua (Barry Song) ; Matthew Wilcox > > Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org; > iommu@lists.linux-foundation.org; linux...@kvack.org; > linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew > Morton ; Alexander Viro ; > gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com; > jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth) > ; zhangfei@linaro.org; chensihang (A) > > Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory > pin > > On 08.02.21 03:27, Song Bao Hua (Barry Song) wrote: > > > > > >> -Original Message- > >> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf > Of > >> Matthew Wilcox > >> Sent: Monday, February 8, 2021 2:31 PM > >> To: Song Bao Hua (Barry Song) > >> Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org; > >> iommu@lists.linux-foundation.org; linux...@kvack.org; > >> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew > >> Morton ; Alexander Viro > ; > >> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com; > >> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth) > >> ; zhangfei@linaro.org; chensihang (A) > >> > >> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory > >> pin > >> > >> On Sun, Feb 07, 2021 at 10:24:28PM +, Song Bao Hua (Barry Song) wrote: > >>>>> In high-performance I/O cases, accelerators might want to perform > >>>>> I/O on a memory without IO page faults which can result in dramatically > >>>>> increased latency. Current memory related APIs could not achieve this > >>>>> requirement, e.g. mlock can only avoid memory to swap to backup device, > >>>>> page migration can still trigger IO page fault. > >>>> > >>>> Well ... we have two requirements. The application wants to not take > >>>> page faults. The system wants to move the application to a different > >>>> NUMA node in order to optimise overall performance. Why should the > >>>> application's desires take precedence over the kernel's desires? And why > >>>> should it be done this way rather than by the sysadmin using numactl to > >>>> lock the application to a particular node? > >>> > >>> NUMA balancer is just one of many reasons for page migration. Even one > >>> simple alloc_pages() can cause memory migration in just single NUMA > >>> node or UMA system. > >>> > >>> The other reasons for page migration include but are not limited to: > >>> * memory move due to CMA > >>> * memory move due to huge pages creation > >>> > >>> Hardly we can ask users to disable the COMPACTION, CMA and Huge Page > >>> in the whole system. > >> > >> You're dodging the question. Should the CMA allocation fail because > >> another application is using SVA? > >> > >> I would say no. > > > > I would say no as well. > > > > While IOMMU is enabled, CMA almost has one user only: IOMMU driver > > as other drivers will depend on iommu to use non-contiguous memory > > though they are still calling dma_alloc_coherent(). > > > > In iommu driver, dma_alloc_coherent is called during initialization > > and there is no new allocation afterwards. So it wouldn't cause > > runtime impact on SVA performance. Even there is new allocations, > > CMA will fall back to general alloc_pages() and iommu drivers are > > almost allocating small memory for command queues. > > > > So I would say general compound pages, huge pages, especially > > transparent huge pages, would be bigger concerns than CMA for > > internal page migration within one NUMA. > > > > Not like CMA, general alloc_pages() can get memory by moving > > pages other than those pinned. > > > > And there is no guarantee we can always bind the memory of > > SVA applications to single one NUMA, so NUMA balancing is > > still a concern. > > > > But I agree we need a way to make CMA success while the userspace > > pages are pinned. Since pin has been viral in many drivers, I > > assume there is a way to handle this. Otherwise, APIs like > > V4L2_ME
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin
> -Original Message- > From: David Rientjes [mailto:rient...@google.com] > Sent: Monday, February 8, 2021 3:18 PM > To: Song Bao Hua (Barry Song) > Cc: Matthew Wilcox ; Wangzhou (B) > ; linux-ker...@vger.kernel.org; > iommu@lists.linux-foundation.org; linux...@kvack.org; > linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew > Morton ; Alexander Viro ; > gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com; > jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth) > ; zhangfei@linaro.org; chensihang (A) > > Subject: RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory > pin > > On Sun, 7 Feb 2021, Song Bao Hua (Barry Song) wrote: > > > NUMA balancer is just one of many reasons for page migration. Even one > > simple alloc_pages() can cause memory migration in just single NUMA > > node or UMA system. > > > > The other reasons for page migration include but are not limited to: > > * memory move due to CMA > > * memory move due to huge pages creation > > > > Hardly we can ask users to disable the COMPACTION, CMA and Huge Page > > in the whole system. > > > > What about only for mlocked memory, i.e. disable > vm.compact_unevictable_allowed? > > Adding syscalls is a big deal, we can make a reasonable inference that > we'll have to support this forever if it's merged. I haven't seen mention > of what other unevictable memory *should* be migratable that would be > adversely affected if we disable that sysctl. Maybe that gets you part of > the way there and there are some other deficiencies, but it seems like a > good start would be to describe how CONFIG_NUMA_BALANCING=n + > vm.compact_unevcitable_allowed + mlock() doesn't get you mostly there and > then look into what's missing. > I believe it can resolve the performance problem for the SVA applications if we disable vm.compact_unevcitable_allowed and NUMA_BALANCE, and use mlock(). The problem is that it is insensible to ask users to disable unevictable_allowed or numa balancing of the whole system only because there is one SVA application in the system. SVA, for itself, is a mechanism to let cpu and devices share same address space. In a typical server system, there are many processes, the better way would be only changing the behavior of the specific process rather than changing the whole system. It is hard to ask users to do that only because there is a SVA monster. Plus, this might negatively affect those applications not using SVA. > If it's a very compelling case where there simply are no alternatives, it > would make sense. Alternative is to find a more generic way, perhaps in > combination with vm.compact_unevictable_allowed, to achieve what you're > looking to do that can be useful even beyond your originally intended use > case. sensible. Actually pin is exactly the way to disable migration for specific pages AKA. disabling "vm.compact_unevictable_allowed" on those pages. It is hard to differentiate what pages should not be migrated. Only apps know that as even SVA applications can allocate many non-IO pages which should be able to move. Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin
> -Original Message- > From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of > Matthew Wilcox > Sent: Monday, February 8, 2021 2:31 PM > To: Song Bao Hua (Barry Song) > Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org; > iommu@lists.linux-foundation.org; linux...@kvack.org; > linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew > Morton ; Alexander Viro ; > gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com; > jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth) > ; zhangfei@linaro.org; chensihang (A) > > Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory > pin > > On Sun, Feb 07, 2021 at 10:24:28PM +, Song Bao Hua (Barry Song) wrote: > > > > In high-performance I/O cases, accelerators might want to perform > > > > I/O on a memory without IO page faults which can result in dramatically > > > > increased latency. Current memory related APIs could not achieve this > > > > requirement, e.g. mlock can only avoid memory to swap to backup device, > > > > page migration can still trigger IO page fault. > > > > > > Well ... we have two requirements. The application wants to not take > > > page faults. The system wants to move the application to a different > > > NUMA node in order to optimise overall performance. Why should the > > > application's desires take precedence over the kernel's desires? And why > > > should it be done this way rather than by the sysadmin using numactl to > > > lock the application to a particular node? > > > > NUMA balancer is just one of many reasons for page migration. Even one > > simple alloc_pages() can cause memory migration in just single NUMA > > node or UMA system. > > > > The other reasons for page migration include but are not limited to: > > * memory move due to CMA > > * memory move due to huge pages creation > > > > Hardly we can ask users to disable the COMPACTION, CMA and Huge Page > > in the whole system. > > You're dodging the question. Should the CMA allocation fail because > another application is using SVA? > > I would say no. I would say no as well. While IOMMU is enabled, CMA almost has one user only: IOMMU driver as other drivers will depend on iommu to use non-contiguous memory though they are still calling dma_alloc_coherent(). In iommu driver, dma_alloc_coherent is called during initialization and there is no new allocation afterwards. So it wouldn't cause runtime impact on SVA performance. Even there is new allocations, CMA will fall back to general alloc_pages() and iommu drivers are almost allocating small memory for command queues. So I would say general compound pages, huge pages, especially transparent huge pages, would be bigger concerns than CMA for internal page migration within one NUMA. Not like CMA, general alloc_pages() can get memory by moving pages other than those pinned. And there is no guarantee we can always bind the memory of SVA applications to single one NUMA, so NUMA balancing is still a concern. But I agree we need a way to make CMA success while the userspace pages are pinned. Since pin has been viral in many drivers, I assume there is a way to handle this. Otherwise, APIs like V4L2_MEMORY_USERPTR[1] will possibly make CMA fail as there is no guarantee that usersspace will allocate unmovable memory and there is no guarantee the fallback path- alloc_pages() can succeed while allocating big memory. Will investigate more. > The application using SVA should take the one-time > performance hit from having its memory moved around. Sometimes I also feel SVA is doomed to suffer from performance impact due to page migration. But we are still trying to extend its use cases to high-performance I/O. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/media/v4l2-core/videobuf-dma-sg.c Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin
> -Original Message- > From: Matthew Wilcox [mailto:wi...@infradead.org] > Sent: Monday, February 8, 2021 10:34 AM > To: Wangzhou (B) > Cc: linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org; > linux...@kvack.org; linux-arm-ker...@lists.infradead.org; > linux-...@vger.kernel.org; Andrew Morton ; > Alexander Viro ; gre...@linuxfoundation.org; Song > Bao Hua (Barry Song) ; j...@ziepe.ca; > kevin.t...@intel.com; jean-phili...@linaro.org; eric.au...@redhat.com; > Liguozhu (Kenneth) ; zhangfei@linaro.org; > chensihang (A) > Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory > pin > > On Sun, Feb 07, 2021 at 04:18:03PM +0800, Zhou Wang wrote: > > SVA(share virtual address) offers a way for device to share process virtual > > address space safely, which makes more convenient for user space device > > driver coding. However, IO page faults may happen when doing DMA > > operations. As the latency of IO page fault is relatively big, DMA > > performance will be affected severely when there are IO page faults. > > >From a long term view, DMA performance will be not stable. > > > > In high-performance I/O cases, accelerators might want to perform > > I/O on a memory without IO page faults which can result in dramatically > > increased latency. Current memory related APIs could not achieve this > > requirement, e.g. mlock can only avoid memory to swap to backup device, > > page migration can still trigger IO page fault. > > Well ... we have two requirements. The application wants to not take > page faults. The system wants to move the application to a different > NUMA node in order to optimise overall performance. Why should the > application's desires take precedence over the kernel's desires? And why > should it be done this way rather than by the sysadmin using numactl to > lock the application to a particular node? NUMA balancer is just one of many reasons for page migration. Even one simple alloc_pages() can cause memory migration in just single NUMA node or UMA system. The other reasons for page migration include but are not limited to: * memory move due to CMA * memory move due to huge pages creation Hardly we can ask users to disable the COMPACTION, CMA and Huge Page in the whole system. On the other hand, numactl doesn't always bind memory to single NUMA node, sometimes, while applications require many cpu, it could bind more than one memory node. Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 2/2] dma-mapping: benchmark: pretend DMA is transmitting
In a real dma mapping user case, after dma_map is done, data will be transmit. Thus, in multi-threaded user scenario, IOMMU contention should not be that severe. For example, if users enable multiple threads to send network packets through 1G/10G/100Gbps NIC, usually the steps will be: map -> transmission -> unmap. Transmission delay reduces the contention of IOMMU. Here a delay is added to simulate the transmission between map and unmap so that the tested result could be more accurate for TX and simple RX. A typical TX transmission for NIC would be like: map -> TX -> unmap since the socket buffers come from OS. Simple RX model eg. disk driver, is also map -> RX -> unmap, but real RX model in a NIC could be more complicated considering packets can come spontaneously and many drivers are using pre-mapped buffers pool. This is in the TBD list. Signed-off-by: Barry Song --- kernel/dma/map_benchmark.c| 12 ++- .../testing/selftests/dma/dma_map_benchmark.c | 21 --- 2 files changed, 29 insertions(+), 4 deletions(-) diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c index da95df381483..e0e64f8b0739 100644 --- a/kernel/dma/map_benchmark.c +++ b/kernel/dma/map_benchmark.c @@ -21,6 +21,7 @@ #define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) #define DMA_MAP_MAX_THREADS1024 #define DMA_MAP_MAX_SECONDS300 +#define DMA_MAP_MAX_TRANS_DELAY(10 * NSEC_PER_MSEC) #define DMA_MAP_BIDIRECTIONAL 0 #define DMA_MAP_TO_DEVICE 1 @@ -36,7 +37,8 @@ struct map_benchmark { __s32 node; /* which numa node this benchmark will run on */ __u32 dma_bits; /* DMA addressing capability */ __u32 dma_dir; /* DMA data direction */ - __u8 expansion[84]; /* For future use */ + __u32 dma_trans_ns; /* time for DMA transmission in ns */ + __u8 expansion[80]; /* For future use */ }; struct map_benchmark_data { @@ -87,6 +89,9 @@ static int map_benchmark_thread(void *data) map_etime = ktime_get(); map_delta = ktime_sub(map_etime, map_stime); + /* Pretend DMA is transmitting */ + ndelay(map->bparam.dma_trans_ns); + unmap_stime = ktime_get(); dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, map->dir); unmap_etime = ktime_get(); @@ -218,6 +223,11 @@ static long map_benchmark_ioctl(struct file *file, unsigned int cmd, return -EINVAL; } + if (map->bparam.dma_trans_ns > DMA_MAP_MAX_TRANS_DELAY) { + pr_err("invalid transmission delay\n"); + return -EINVAL; + } + if (map->bparam.node != NUMA_NO_NODE && !node_possible(map->bparam.node)) { pr_err("invalid numa node\n"); diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c b/tools/testing/selftests/dma/dma_map_benchmark.c index 537d65968c48..fb23ce9617ea 100644 --- a/tools/testing/selftests/dma/dma_map_benchmark.c +++ b/tools/testing/selftests/dma/dma_map_benchmark.c @@ -12,9 +12,12 @@ #include #include +#define NSEC_PER_MSEC 100L + #define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) #define DMA_MAP_MAX_THREADS1024 #define DMA_MAP_MAX_SECONDS 300 +#define DMA_MAP_MAX_TRANS_DELAY(10 * NSEC_PER_MSEC) #define DMA_MAP_BIDIRECTIONAL 0 #define DMA_MAP_TO_DEVICE 1 @@ -36,7 +39,8 @@ struct map_benchmark { __s32 node; /* which numa node this benchmark will run on */ __u32 dma_bits; /* DMA addressing capability */ __u32 dma_dir; /* DMA data direction */ - __u8 expansion[84]; /* For future use */ + __u32 dma_trans_ns; /* time for DMA transmission in ns */ + __u8 expansion[80]; /* For future use */ }; int main(int argc, char **argv) @@ -46,12 +50,12 @@ int main(int argc, char **argv) /* default single thread, run 20 seconds on NUMA_NO_NODE */ int threads = 1, seconds = 20, node = -1; /* default dma mask 32bit, bidirectional DMA */ - int bits = 32, dir = DMA_MAP_BIDIRECTIONAL; + int bits = 32, xdelay = 0, dir = DMA_MAP_BIDIRECTIONAL; int cmd = DMA_MAP_BENCHMARK; char *p; - while ((opt = getopt(argc, argv, "t:s:n:b:d:")) != -1) { + while ((opt = getopt(argc, argv, "t:s:n:b:d:x:")) != -1) { switch (opt) { case 't': threads = atoi(optarg); @@ -68,6 +72,9 @@ int main(int argc, char **argv) case 'd': dir = atoi(optarg); break; + case 'x': + xdelay = atoi(optarg); + break;
[PATCH v3 1/2] dma-mapping: benchmark: use u8 for reserved field in uAPI structure
The original code put five u32 before a u64 expansion[10] array. Five is odd, this will cause trouble in the extension of the structure by adding new features. This patch moves to use u8 for reserved field to avoid future alignment risk. Meanwhile, it also clears the memory of struct map_benchmark in tools, otherwise, if users use old version to run on newer kernel, the random expansion value will cause side effect on newer kernel. Signed-off-by: Barry Song --- kernel/dma/map_benchmark.c | 2 +- tools/testing/selftests/dma/dma_map_benchmark.c | 4 +++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c index 1b1b8ff875cb..da95df381483 100644 --- a/kernel/dma/map_benchmark.c +++ b/kernel/dma/map_benchmark.c @@ -36,7 +36,7 @@ struct map_benchmark { __s32 node; /* which numa node this benchmark will run on */ __u32 dma_bits; /* DMA addressing capability */ __u32 dma_dir; /* DMA data direction */ - __u64 expansion[10];/* For future use */ + __u8 expansion[84]; /* For future use */ }; struct map_benchmark_data { diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c b/tools/testing/selftests/dma/dma_map_benchmark.c index 7065163a8388..537d65968c48 100644 --- a/tools/testing/selftests/dma/dma_map_benchmark.c +++ b/tools/testing/selftests/dma/dma_map_benchmark.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -35,7 +36,7 @@ struct map_benchmark { __s32 node; /* which numa node this benchmark will run on */ __u32 dma_bits; /* DMA addressing capability */ __u32 dma_dir; /* DMA data direction */ - __u64 expansion[10];/* For future use */ + __u8 expansion[84]; /* For future use */ }; int main(int argc, char **argv) @@ -102,6 +103,7 @@ int main(int argc, char **argv) exit(1); } + memset(&map, 0, sizeof(map)); map.seconds = seconds; map.threads = threads; map.node = node; -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v2] dma-mapping: benchmark: pretend DMA is transmitting
> -Original Message- > From: Christoph Hellwig [mailto:h...@lst.de] > Sent: Friday, February 5, 2021 11:36 PM > To: Song Bao Hua (Barry Song) > Cc: Christoph Hellwig ; m.szyprow...@samsung.com; > robin.mur...@arm.com; iommu@lists.linux-foundation.org; > linux-ker...@vger.kernel.org; linux...@openeuler.org > Subject: Re: [PATCH v2] dma-mapping: benchmark: pretend DMA is transmitting > > On Fri, Feb 05, 2021 at 10:32:26AM +, Song Bao Hua (Barry Song) wrote: > > I can keep the struct size unchanged by changing the struct to > > > > struct map_benchmark { > > __u64 avg_map_100ns; /* average map latency in 100ns */ > > __u64 map_stddev; /* standard deviation of map latency */ > > __u64 avg_unmap_100ns; /* as above */ > > __u64 unmap_stddev; > > __u32 threads; /* how many threads will do map/unmap in parallel */ > > __u32 seconds; /* how long the test will last */ > > __s32 node; /* which numa node this benchmark will run on */ > > __u32 dma_bits; /* DMA addressing capability */ > > __u32 dma_dir; /* DMA data direction */ > > __u32 dma_trans_ns; /* time for DMA transmission in ns */ > > > > __u32 exp; /* For future use */ > > __u64 expansion[9]; /* For future use */ > > }; > > > > But the code is really ugly now. > > Thats why we usually use __u8 fields for reserved field. You might > consider just switching to that instead while you're at it. I guess > we'll just have to get the addition into 5.11 then to make sure we > don't release a kernel with the alignment fix. I assume there is no need to keep the same size with 5.11-rc, so could change the struct to: struct map_benchmark { __u64 avg_map_100ns; /* average map latency in 100ns */ __u64 map_stddev; /* standard deviation of map latency */ __u64 avg_unmap_100ns; /* as above */ __u64 unmap_stddev; __u32 threads; /* how many threads will do map/unmap in parallel */ __u32 seconds; /* how long the test will last */ __s32 node; /* which numa node this benchmark will run on */ __u32 dma_bits; /* DMA addressing capability */ __u32 dma_dir; /* DMA data direction */ __u8 expansion[84]; /* For future use */ }; This won't increase size on 64bit system, but it increases 4bytes on 32bits system comparing to 5.11-rc. How do you think about it? Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v2] dma-mapping: benchmark: pretend DMA is transmitting
> -Original Message- > From: Christoph Hellwig [mailto:h...@lst.de] > Sent: Friday, February 5, 2021 10:21 PM > To: Song Bao Hua (Barry Song) > Cc: m.szyprow...@samsung.com; h...@lst.de; robin.mur...@arm.com; > iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org; > linux...@openeuler.org > Subject: Re: [PATCH v2] dma-mapping: benchmark: pretend DMA is transmitting > > On Fri, Feb 05, 2021 at 03:00:35PM +1300, Barry Song wrote: > > + __u32 dma_trans_ns; /* time for DMA transmission in ns */ > > __u64 expansion[10];/* For future use */ > > We need to keep the struct size, so the expansion field needs to > shrink by the equivalent amount of data that is added in dma_trans_ns. Unfortunately I didn't put a rsv u32 field after dma_dir in the original patch. There were five 32bits data before expansion[]: struct map_benchmark { __u64 avg_map_100ns; /* average map latency in 100ns */ __u64 map_stddev; /* standard deviation of map latency */ __u64 avg_unmap_100ns; /* as above */ __u64 unmap_stddev; __u32 threads; /* how many threads will do map/unmap in parallel */ __u32 seconds; /* how long the test will last */ __s32 node; /* which numa node this benchmark will run on */ __u32 dma_bits; /* DMA addressing capability */ __u32 dma_dir; /* DMA data direction */ __u64 expansion[10];/* For future use */ }; My bad. That was really silly. I should have done the below from the first beginning: struct map_benchmark { __u64 avg_map_100ns; /* average map latency in 100ns */ __u64 map_stddev; /* standard deviation of map latency */ __u64 avg_unmap_100ns; /* as above */ __u64 unmap_stddev; __u32 threads; /* how many threads will do map/unmap in parallel */ __u32 seconds; /* how long the test will last */ __s32 node; /* which numa node this benchmark will run on */ __u32 dma_bits; /* DMA addressing capability */ __u32 dma_dir; /* DMA data direction */ __u32 rsv; __u64 expansion[10];/* For future use */ }; So on 64bit system, this patch doesn't change the length of struct as the new added u32 just fill the gap between dma_dir and expansion. For 32bit system, this patch increases 4 bytes in the length. I can keep the struct size unchanged by changing the struct to struct map_benchmark { __u64 avg_map_100ns; /* average map latency in 100ns */ __u64 map_stddev; /* standard deviation of map latency */ __u64 avg_unmap_100ns; /* as above */ __u64 unmap_stddev; __u32 threads; /* how many threads will do map/unmap in parallel */ __u32 seconds; /* how long the test will last */ __s32 node; /* which numa node this benchmark will run on */ __u32 dma_bits; /* DMA addressing capability */ __u32 dma_dir; /* DMA data direction */ __u32 dma_trans_ns; /* time for DMA transmission in ns */ __u32 exp; /* For future use */ __u64 expansion[9]; /* For future use */ }; But the code is really ugly now. Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2] dma-mapping: benchmark: pretend DMA is transmitting
In a real dma mapping user case, after dma_map is done, data will be transmit. Thus, in multi-threaded user scenario, IOMMU contention should not be that severe. For example, if users enable multiple threads to send network packets through 1G/10G/100Gbps NIC, usually the steps will be: map -> transmission -> unmap. Transmission delay reduces the contention of IOMMU. Here a delay is added to simulate the transmission between map and unmap so that the tested result could be more accurate for TX and simple RX. A typical TX transmission for NIC would be like: map -> TX -> unmap since the socket buffers come from OS. Simple RX model eg. disk driver, is also map -> RX -> unmap, but real RX model in a NIC could be more complicated considering packets can come spontaneously and many drivers are using pre-mapped buffers pool. This is in the TBD list. Signed-off-by: Barry Song --- -v2: cleanup according to Robin's feedback. thanks, Robin. kernel/dma/map_benchmark.c| 10 ++ .../testing/selftests/dma/dma_map_benchmark.c | 19 +-- 2 files changed, 27 insertions(+), 2 deletions(-) diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c index 1b1b8ff875cb..06636406a245 100644 --- a/kernel/dma/map_benchmark.c +++ b/kernel/dma/map_benchmark.c @@ -21,6 +21,7 @@ #define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) #define DMA_MAP_MAX_THREADS1024 #define DMA_MAP_MAX_SECONDS300 +#define DMA_MAP_MAX_TRANS_DELAY(10 * NSEC_PER_MSEC) /* 10ms */ #define DMA_MAP_BIDIRECTIONAL 0 #define DMA_MAP_TO_DEVICE 1 @@ -36,6 +37,7 @@ struct map_benchmark { __s32 node; /* which numa node this benchmark will run on */ __u32 dma_bits; /* DMA addressing capability */ __u32 dma_dir; /* DMA data direction */ + __u32 dma_trans_ns; /* time for DMA transmission in ns */ __u64 expansion[10];/* For future use */ }; @@ -87,6 +89,9 @@ static int map_benchmark_thread(void *data) map_etime = ktime_get(); map_delta = ktime_sub(map_etime, map_stime); + /* Pretend DMA is transmitting */ + ndelay(map->bparam.dma_trans_ns); + unmap_stime = ktime_get(); dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, map->dir); unmap_etime = ktime_get(); @@ -218,6 +223,11 @@ static long map_benchmark_ioctl(struct file *file, unsigned int cmd, return -EINVAL; } + if (map->bparam.dma_trans_ns > DMA_MAP_MAX_TRANS_DELAY) { + pr_err("invalid transmission delay\n"); + return -EINVAL; + } + if (map->bparam.node != NUMA_NO_NODE && !node_possible(map->bparam.node)) { pr_err("invalid numa node\n"); diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c b/tools/testing/selftests/dma/dma_map_benchmark.c index 7065163a8388..a370290d9503 100644 --- a/tools/testing/selftests/dma/dma_map_benchmark.c +++ b/tools/testing/selftests/dma/dma_map_benchmark.c @@ -11,9 +11,12 @@ #include #include +#define NSEC_PER_MSEC 100L + #define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) #define DMA_MAP_MAX_THREADS1024 #define DMA_MAP_MAX_SECONDS 300 +#define DMA_MAP_MAX_TRANS_DELAY(10 * NSEC_PER_MSEC) /* 10ms */ #define DMA_MAP_BIDIRECTIONAL 0 #define DMA_MAP_TO_DEVICE 1 @@ -35,6 +38,7 @@ struct map_benchmark { __s32 node; /* which numa node this benchmark will run on */ __u32 dma_bits; /* DMA addressing capability */ __u32 dma_dir; /* DMA data direction */ + __u32 dma_trans_ns; /* delay for DMA transmission in ns */ __u64 expansion[10];/* For future use */ }; @@ -45,12 +49,12 @@ int main(int argc, char **argv) /* default single thread, run 20 seconds on NUMA_NO_NODE */ int threads = 1, seconds = 20, node = -1; /* default dma mask 32bit, bidirectional DMA */ - int bits = 32, dir = DMA_MAP_BIDIRECTIONAL; + int bits = 32, xdelay = 0, dir = DMA_MAP_BIDIRECTIONAL; int cmd = DMA_MAP_BENCHMARK; char *p; - while ((opt = getopt(argc, argv, "t:s:n:b:d:")) != -1) { + while ((opt = getopt(argc, argv, "t:s:n:b:d:x:")) != -1) { switch (opt) { case 't': threads = atoi(optarg); @@ -67,6 +71,9 @@ int main(int argc, char **argv) case 'd': dir = atoi(optarg); break; + case 'x': + xdelay = atoi(optarg); + break; default: return -1; } @@ -
RE: [PATCH] dma-mapping: benchmark: pretend DMA is transmitting
> -Original Message- > From: Robin Murphy [mailto:robin.mur...@arm.com] > Sent: Friday, February 5, 2021 12:51 PM > To: Song Bao Hua (Barry Song) ; > m.szyprow...@samsung.com; h...@lst.de; iommu@lists.linux-foundation.org > Cc: linux-ker...@vger.kernel.org; linux...@openeuler.org > Subject: Re: [PATCH] dma-mapping: benchmark: pretend DMA is transmitting > > On 2021-02-04 22:58, Barry Song wrote: > > In a real dma mapping user case, after dma_map is done, data will be > > transmit. Thus, in multi-threaded user scenario, IOMMU contention > > should not be that severe. For example, if users enable multiple > > threads to send network packets through 1G/10G/100Gbps NIC, usually > > the steps will be: map -> transmission -> unmap. Transmission delay > > reduces the contention of IOMMU. Here a delay is added to simulate > > the transmission for TX case so that the tested result could be > > more accurate. > > > > RX case would be much more tricky. It is not supported yet. > > I guess it might be a reasonable approximation to map several pages, > then unmap them again after a slightly more random delay. Or maybe > divide the threads into pairs of mappers and unmappers respectively > filling up and draining proper little buffer pools. Yes. Good suggestions. I am actually thinking about how to support cases like networks. There is a pre-mapped list of pages, each page is bound with some hardware DMA block descriptor(BD). So if Linux can consume the packets in time, those buffers are always re-used. Only when the page bound with BD is full and OS can't consume it in time, another temp page will be allocated and mapped, BD will switch to use this temp page, then finally unmap it if it is not needed any more. On the other hand, the pre-mapped pages are never unmapped. For things like filesystem and disk driver, RX is always requested by users. The model would be simpler: map -> rx -> unmap. For networks, RX transmission can come spontaneously. Anyway, I'll put this into TBD. For this moment, mainly handle TX path. Or maybe the current code has been able to handle simple RX model :-) > > > Signed-off-by: Barry Song > > --- > > kernel/dma/map_benchmark.c | 11 +++ > > tools/testing/selftests/dma/dma_map_benchmark.c | 17 +++-- > > 2 files changed, 26 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c > > index 1b1b8ff875cb..1976db7e34e4 100644 > > --- a/kernel/dma/map_benchmark.c > > +++ b/kernel/dma/map_benchmark.c > > @@ -21,6 +21,7 @@ > > #define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) > > #define DMA_MAP_MAX_THREADS 1024 > > #define DMA_MAP_MAX_SECONDS 300 > > +#define DMA_MAP_MAX_TRANS_DELAY(10 * 1000 * 1000) /* 10ms */ > > Using MSEC_PER_SEC might be sufficiently self-documenting? Yes, I guess you mean NSEC_PER_MSEC. will move to it. > > > #define DMA_MAP_BIDIRECTIONAL 0 > > #define DMA_MAP_TO_DEVICE 1 > > @@ -36,6 +37,7 @@ struct map_benchmark { > > __s32 node; /* which numa node this benchmark will run on */ > > __u32 dma_bits; /* DMA addressing capability */ > > __u32 dma_dir; /* DMA data direction */ > > + __u32 dma_trans_ns; /* time for DMA transmission in ns */ > > __u64 expansion[10];/* For future use */ > > }; > > > > @@ -87,6 +89,10 @@ static int map_benchmark_thread(void *data) > > map_etime = ktime_get(); > > map_delta = ktime_sub(map_etime, map_stime); > > > > + /* Pretend DMA is transmitting */ > > + if (map->dir != DMA_FROM_DEVICE) > > + ndelay(map->bparam.dma_trans_ns); > > TBH I think the option of a fixed delay between map and unmap might be a > handy thing in general, so having the direction check at all seems > needlessly restrictive. As long as the driver implements all the basic > building blocks, combining them to simulate specific traffic patterns > can be left up to the benchmark tool. Sensible, will remove the condition check. > > Robin. > > > + > > unmap_stime = ktime_get(); > > dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, map->dir); > > unmap_etime = ktime_get(); > > @@ -218,6 +224,11 @@ static long map_benchmark_ioctl(struct file *file, > unsigned int cmd, > > return -EINVAL; > > } > > > > + if (map->bparam.dma_trans_ns > DMA_MAP_MAX_TRANS_DELAY) { > > + pr_err("invalid tr
[PATCH] dma-mapping: benchmark: pretend DMA is transmitting
In a real dma mapping user case, after dma_map is done, data will be transmit. Thus, in multi-threaded user scenario, IOMMU contention should not be that severe. For example, if users enable multiple threads to send network packets through 1G/10G/100Gbps NIC, usually the steps will be: map -> transmission -> unmap. Transmission delay reduces the contention of IOMMU. Here a delay is added to simulate the transmission for TX case so that the tested result could be more accurate. RX case would be much more tricky. It is not supported yet. Signed-off-by: Barry Song --- kernel/dma/map_benchmark.c | 11 +++ tools/testing/selftests/dma/dma_map_benchmark.c | 17 +++-- 2 files changed, 26 insertions(+), 2 deletions(-) diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c index 1b1b8ff875cb..1976db7e34e4 100644 --- a/kernel/dma/map_benchmark.c +++ b/kernel/dma/map_benchmark.c @@ -21,6 +21,7 @@ #define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) #define DMA_MAP_MAX_THREADS1024 #define DMA_MAP_MAX_SECONDS300 +#define DMA_MAP_MAX_TRANS_DELAY(10 * 1000 * 1000) /* 10ms */ #define DMA_MAP_BIDIRECTIONAL 0 #define DMA_MAP_TO_DEVICE 1 @@ -36,6 +37,7 @@ struct map_benchmark { __s32 node; /* which numa node this benchmark will run on */ __u32 dma_bits; /* DMA addressing capability */ __u32 dma_dir; /* DMA data direction */ + __u32 dma_trans_ns; /* time for DMA transmission in ns */ __u64 expansion[10];/* For future use */ }; @@ -87,6 +89,10 @@ static int map_benchmark_thread(void *data) map_etime = ktime_get(); map_delta = ktime_sub(map_etime, map_stime); + /* Pretend DMA is transmitting */ + if (map->dir != DMA_FROM_DEVICE) + ndelay(map->bparam.dma_trans_ns); + unmap_stime = ktime_get(); dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, map->dir); unmap_etime = ktime_get(); @@ -218,6 +224,11 @@ static long map_benchmark_ioctl(struct file *file, unsigned int cmd, return -EINVAL; } + if (map->bparam.dma_trans_ns > DMA_MAP_MAX_TRANS_DELAY) { + pr_err("invalid transmission delay\n"); + return -EINVAL; + } + if (map->bparam.node != NUMA_NO_NODE && !node_possible(map->bparam.node)) { pr_err("invalid numa node\n"); diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c b/tools/testing/selftests/dma/dma_map_benchmark.c index 7065163a8388..dbf426e2fb7f 100644 --- a/tools/testing/selftests/dma/dma_map_benchmark.c +++ b/tools/testing/selftests/dma/dma_map_benchmark.c @@ -14,6 +14,7 @@ #define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) #define DMA_MAP_MAX_THREADS1024 #define DMA_MAP_MAX_SECONDS 300 +#define DMA_MAP_MAX_TRANS_DELAY(10 * 1000 * 1000) /* 10ms */ #define DMA_MAP_BIDIRECTIONAL 0 #define DMA_MAP_TO_DEVICE 1 @@ -35,6 +36,7 @@ struct map_benchmark { __s32 node; /* which numa node this benchmark will run on */ __u32 dma_bits; /* DMA addressing capability */ __u32 dma_dir; /* DMA data direction */ + __u32 dma_trans_ns; /* delay for DMA transmission in ns */ __u64 expansion[10];/* For future use */ }; @@ -45,12 +47,12 @@ int main(int argc, char **argv) /* default single thread, run 20 seconds on NUMA_NO_NODE */ int threads = 1, seconds = 20, node = -1; /* default dma mask 32bit, bidirectional DMA */ - int bits = 32, dir = DMA_MAP_BIDIRECTIONAL; + int bits = 32, xdelay = 0, dir = DMA_MAP_BIDIRECTIONAL; int cmd = DMA_MAP_BENCHMARK; char *p; - while ((opt = getopt(argc, argv, "t:s:n:b:d:")) != -1) { + while ((opt = getopt(argc, argv, "t:s:n:b:d:x:")) != -1) { switch (opt) { case 't': threads = atoi(optarg); @@ -67,6 +69,9 @@ int main(int argc, char **argv) case 'd': dir = atoi(optarg); break; + case 'x': + xdelay = atoi(optarg); + break; default: return -1; } @@ -84,6 +89,12 @@ int main(int argc, char **argv) exit(1); } + if (xdelay < 0 || xdelay > DMA_MAP_MAX_TRANS_DELAY) { + fprintf(stderr, "invalid transmit delay, must be in 0-%d\n", + DMA_MAP_MAX_TRANS_DELAY); + exit(1); + } + /* suppose the mininum DMA zone is 1MB in the world */ if (bits <
RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> -Original Message- > From: Tian, Kevin [mailto:kevin.t...@intel.com] > Sent: Tuesday, February 2, 2021 3:52 PM > To: Jason Gunthorpe > Cc: Song Bao Hua (Barry Song) ; chensihang (A) > ; Arnd Bergmann ; Greg > Kroah-Hartman ; linux-ker...@vger.kernel.org; > iommu@lists.linux-foundation.org; linux...@kvack.org; Zhangfei Gao > ; Liguozhu (Kenneth) ; > linux-accelerat...@lists.ozlabs.org > Subject: RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device > > > From: Jason Gunthorpe > > Sent: Tuesday, February 2, 2021 7:44 AM > > > > On Fri, Jan 29, 2021 at 10:09:03AM +, Tian, Kevin wrote: > > > > SVA is not doom to work with IO page fault only. If we have SVA+pin, > > > > we would get both sharing address and stable I/O latency. > > > > > > Isn't it like a traditional MAP_DMA API (imply pinning) plus specifying > > > cpu_va of the memory pool as the iova? > > > > I think their issue is the HW can't do the cpu_va trick without also > > involving the system IOMMU in a SVA mode > > > > This is the part that I didn't understand. Using cpu_va in a MAP_DMA > interface doesn't require device support. It's just an user-specified > address to be mapped into the IOMMU page table. On the other hand, The background is that uacce is based on SVA and we are building applications on uacce: https://www.kernel.org/doc/html/v5.10/misc-devices/uacce.html so IOMMU simply uses the page table of MMU, and don't do any special mapping to an user-specified address. We don't break the basic assumption that uacce is using SVA, otherwise, we need to re-build uacce and the whole base. > sharing CPU page table through a SVA interface for an usage where I/O > page faults must be completely avoided seems a misleading attempt. That is not for completely avoiding IO page fault, that is just an extension for high-performance I/O case, providing a way to avoid IO latency jitter. Using it or not is totally up to users. > Even if people do want this model (e.g. mix pinning+fault), it should be > a mm syscall as Greg pointed out, not specific to sva. > We are glad to make it a syscall if people are happy with it. The simplest way would be a syscall similar with userfaultfd if we don't want to mess up mm_struct. > Thanks > Kevin Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> -Original Message- > From: Jason Gunthorpe [mailto:j...@ziepe.ca] > Sent: Tuesday, February 2, 2021 12:44 PM > To: Tian, Kevin > Cc: Song Bao Hua (Barry Song) ; chensihang (A) > ; Arnd Bergmann ; Greg > Kroah-Hartman ; linux-ker...@vger.kernel.org; > iommu@lists.linux-foundation.org; linux...@kvack.org; Zhangfei Gao > ; Liguozhu (Kenneth) ; > linux-accelerat...@lists.ozlabs.org > Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device > > On Fri, Jan 29, 2021 at 10:09:03AM +, Tian, Kevin wrote: > > > SVA is not doom to work with IO page fault only. If we have SVA+pin, > > > we would get both sharing address and stable I/O latency. > > > > Isn't it like a traditional MAP_DMA API (imply pinning) plus specifying > > cpu_va of the memory pool as the iova? > > I think their issue is the HW can't do the cpu_va trick without also > involving the system IOMMU in a SVA mode > > It really is something that belongs under some general /dev/sva as we > talked on the vfio thread AFAIK, there is no this /dev/sva so /dev/uacce is an uAPI which belongs to sva. Another option is that we add a system call like fs/userfaultfd.c, and move the file_operations and ioctl to the anon inode by creating fd via anon_inode_getfd(). Then nothing will be buried by uacce. > > Jason Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> -Original Message- > From: Tian, Kevin [mailto:kevin.t...@intel.com] > Sent: Friday, January 29, 2021 11:09 PM > To: Song Bao Hua (Barry Song) ; Jason Gunthorpe > > Cc: chensihang (A) ; Arnd Bergmann > ; Greg Kroah-Hartman ; > linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org; > linux...@kvack.org; Zhangfei Gao ; Liguozhu > (Kenneth) ; linux-accelerat...@lists.ozlabs.org > Subject: RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device > > > From: Song Bao Hua (Barry Song) > > Sent: Tuesday, January 26, 2021 9:27 AM > > > > > -Original Message- > > > From: Jason Gunthorpe [mailto:j...@ziepe.ca] > > > Sent: Tuesday, January 26, 2021 2:13 PM > > > To: Song Bao Hua (Barry Song) > > > Cc: Wangzhou (B) ; Greg Kroah-Hartman > > > ; Arnd Bergmann ; > > Zhangfei Gao > > > ; linux-accelerat...@lists.ozlabs.org; > > > linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org; > > > linux...@kvack.org; Liguozhu (Kenneth) ; > > chensihang > > > (A) > > > Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device > > > > > > On Mon, Jan 25, 2021 at 11:35:22PM +, Song Bao Hua (Barry Song) > > wrote: > > > > > > > > On Mon, Jan 25, 2021 at 10:21:14PM +, Song Bao Hua (Barry Song) > > wrote: > > > > > > mlock, while certainly be able to prevent swapping out, it won't > > > > > > be able to stop page moving due to: > > > > > > * memory compaction in alloc_pages() > > > > > > * making huge pages > > > > > > * numa balance > > > > > > * memory compaction in CMA > > > > > > > > > > Enabling those things is a major reason to have SVA device in the > > > > > first place, providing a SW API to turn it all off seems like the > > > > > wrong direction. > > > > > > > > I wouldn't say this is a major reason to have SVA. If we read the > > > > history of SVA and papers, people would think easy programming due > > > > to data struct sharing between cpu and device, and process space > > > > isolation in device would be the major reasons for SVA. SVA also > > > > declares it supports zero-copy while zero-copy doesn't necessarily > > > > depend on SVA. > > > > > > Once you have to explicitly make system calls to declare memory under > > > IO, you loose all of that. > > > > > > Since you've asked the app to be explicit about the DMAs it intends to > > > do, there is not really much reason to use SVA for those DMAs anymore. > > > > Let's see a non-SVA case. We are not using SVA, we can have > > a memory pool by hugetlb or pin, and app can allocate memory > > from this pool, and get stable I/O performance on the memory > > from the pool. But device has its separate page table which > > is not bound with this process, thus lacking the protection > > of process space isolation. Plus, CPU and device are using > > different address. > > > > And then we move to SVA case, we can still have a memory pool > > by hugetlb or pin, and app can allocate memory from this pool > > since this pool is mapped to the address space of the process, > > and we are able to get stable I/O performance since it is always > > there. But in this case, device is using the page table of > > process with the full permission control. > > And they are using same address and can possibly enjoy the easy > > programming if HW supports. > > > > SVA is not doom to work with IO page fault only. If we have SVA+pin, > > we would get both sharing address and stable I/O latency. > > > > Isn't it like a traditional MAP_DMA API (imply pinning) plus specifying > cpu_va of the memory pool as the iova? I think it enjoys the advantage of stable I/O latency of traditional MAP_DMA, and also uses the process page table which SVA can provide. The major difference is that in SVA case, iova totally belongs to process and is as normal as other heap/stack/data: p = mmap(.MAP_ANON); ioctl(/dev/acc, p, PIN); SVA for itself, provides the ability to guarantee the address space isolation of multiple processes. If the device can access the data struct such as list, tree directly, they can further enjoy the convenience of programming SVA gives. So we are looking for a combination of stable io latency of traditional DMA map and the ability of SVA. > > Thanks > Kevin Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> -Original Message- > From: Jason Gunthorpe [mailto:j...@ziepe.ca] > Sent: Wednesday, January 27, 2021 7:20 AM > To: Song Bao Hua (Barry Song) > Cc: Wangzhou (B) ; Greg Kroah-Hartman > ; Arnd Bergmann ; Zhangfei Gao > ; linux-accelerat...@lists.ozlabs.org; > linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org; > linux...@kvack.org; Liguozhu (Kenneth) ; chensihang > (A) > Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device > > On Tue, Jan 26, 2021 at 01:26:45AM +, Song Bao Hua (Barry Song) wrote: > > > On Mon, Jan 25, 2021 at 11:35:22PM +, Song Bao Hua (Barry Song) wrote: > > > > > > > > On Mon, Jan 25, 2021 at 10:21:14PM +, Song Bao Hua (Barry Song) > wrote: > > > > > > mlock, while certainly be able to prevent swapping out, it won't > > > > > > be able to stop page moving due to: > > > > > > * memory compaction in alloc_pages() > > > > > > * making huge pages > > > > > > * numa balance > > > > > > * memory compaction in CMA > > > > > > > > > > Enabling those things is a major reason to have SVA device in the > > > > > first place, providing a SW API to turn it all off seems like the > > > > > wrong direction. > > > > > > > > I wouldn't say this is a major reason to have SVA. If we read the > > > > history of SVA and papers, people would think easy programming due > > > > to data struct sharing between cpu and device, and process space > > > > isolation in device would be the major reasons for SVA. SVA also > > > > declares it supports zero-copy while zero-copy doesn't necessarily > > > > depend on SVA. > > > > > > Once you have to explicitly make system calls to declare memory under > > > IO, you loose all of that. > > > > > > Since you've asked the app to be explicit about the DMAs it intends to > > > do, there is not really much reason to use SVA for those DMAs anymore. > > > > Let's see a non-SVA case. We are not using SVA, we can have > > a memory pool by hugetlb or pin, and app can allocate memory > > from this pool, and get stable I/O performance on the memory > > from the pool. But device has its separate page table which > > is not bound with this process, thus lacking the protection > > of process space isolation. Plus, CPU and device are using > > different address. > > So you are relying on the platform to do the SVA for the device? > Sorry for late response. uacce and its userspace framework UADK depend on SVA, leveraging the enhanced security by isolated process address space. This patch is mainly an extension for performance optimization to get stable high-performance I/O on pinned memory even though the hardware supports IO page fault to get pages back after swapping out or page migration. But IO page fault will cause serious latency jitter for high-speed I/O. For slow speed device, they don't need to use this extension. > This feels like it goes back to another topic where I felt the SVA > setup uAPI should be shared and not buried into every driver's unique > ioctls. > > Having something like this in a shared SVA system is somewhat less > strange. Sounds reasonable. On the other hand, uacce seems to be an common uAPI for SVA, and probably the only one for this moment. uacce is a framework not a specific driver as any accelerators can hook into this framework as long as a device provides uacce_ops and register itself by uacce_register(). Uacce, for itself, doesn't bind with any specific hardware. So uacce interfaces are kind of common uAPI :-) > > Jason Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> -Original Message- > From: Jason Gunthorpe [mailto:j...@ziepe.ca] > Sent: Tuesday, January 26, 2021 2:13 PM > To: Song Bao Hua (Barry Song) > Cc: Wangzhou (B) ; Greg Kroah-Hartman > ; Arnd Bergmann ; Zhangfei Gao > ; linux-accelerat...@lists.ozlabs.org; > linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org; > linux...@kvack.org; Liguozhu (Kenneth) ; chensihang > (A) > Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device > > On Mon, Jan 25, 2021 at 11:35:22PM +, Song Bao Hua (Barry Song) wrote: > > > > On Mon, Jan 25, 2021 at 10:21:14PM +, Song Bao Hua (Barry Song) wrote: > > > > mlock, while certainly be able to prevent swapping out, it won't > > > > be able to stop page moving due to: > > > > * memory compaction in alloc_pages() > > > > * making huge pages > > > > * numa balance > > > > * memory compaction in CMA > > > > > > Enabling those things is a major reason to have SVA device in the > > > first place, providing a SW API to turn it all off seems like the > > > wrong direction. > > > > I wouldn't say this is a major reason to have SVA. If we read the > > history of SVA and papers, people would think easy programming due > > to data struct sharing between cpu and device, and process space > > isolation in device would be the major reasons for SVA. SVA also > > declares it supports zero-copy while zero-copy doesn't necessarily > > depend on SVA. > > Once you have to explicitly make system calls to declare memory under > IO, you loose all of that. > > Since you've asked the app to be explicit about the DMAs it intends to > do, there is not really much reason to use SVA for those DMAs anymore. Let's see a non-SVA case. We are not using SVA, we can have a memory pool by hugetlb or pin, and app can allocate memory from this pool, and get stable I/O performance on the memory from the pool. But device has its separate page table which is not bound with this process, thus lacking the protection of process space isolation. Plus, CPU and device are using different address. And then we move to SVA case, we can still have a memory pool by hugetlb or pin, and app can allocate memory from this pool since this pool is mapped to the address space of the process, and we are able to get stable I/O performance since it is always there. But in this case, device is using the page table of process with the full permission control. And they are using same address and can possibly enjoy the easy programming if HW supports. SVA is not doom to work with IO page fault only. If we have SVA+pin, we would get both sharing address and stable I/O latency. > > Jason Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> -Original Message- > From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of > Jason Gunthorpe > Sent: Tuesday, January 26, 2021 12:16 PM > To: Song Bao Hua (Barry Song) > Cc: Wangzhou (B) ; Greg Kroah-Hartman > ; Arnd Bergmann ; Zhangfei Gao > ; linux-accelerat...@lists.ozlabs.org; > linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org; > linux...@kvack.org; Liguozhu (Kenneth) ; chensihang > (A) > Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device > > On Mon, Jan 25, 2021 at 10:21:14PM +, Song Bao Hua (Barry Song) wrote: > > mlock, while certainly be able to prevent swapping out, it won't > > be able to stop page moving due to: > > * memory compaction in alloc_pages() > > * making huge pages > > * numa balance > > * memory compaction in CMA > > Enabling those things is a major reason to have SVA device in the > first place, providing a SW API to turn it all off seems like the > wrong direction. I wouldn't say this is a major reason to have SVA. If we read the history of SVA and papers, people would think easy programming due to data struct sharing between cpu and device, and process space isolation in device would be the major reasons for SVA. SVA also declares it supports zero-copy while zero-copy doesn't necessarily depend on SVA. Page migration and I/O page fault overhead, on the other hand, would probably be the major problems which block SVA becoming a high-performance and more popular solution. > > If the device doesn't want to use SVA then don't use it, use normal > DMA pinning like everything else. > If we disable SVA, we won't get the benefits of SVA on address sharing, and process space isolation. > Jason Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> -Original Message- > From: Jason Gunthorpe [mailto:j...@ziepe.ca] > Sent: Tuesday, January 26, 2021 4:47 AM > To: Wangzhou (B) > Cc: Greg Kroah-Hartman ; Arnd Bergmann > ; Zhangfei Gao ; > linux-accelerat...@lists.ozlabs.org; linux-ker...@vger.kernel.org; > iommu@lists.linux-foundation.org; linux...@kvack.org; Song Bao Hua (Barry > Song) > ; Liguozhu (Kenneth) ; > chensihang (A) > Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device > > On Mon, Jan 25, 2021 at 04:34:56PM +0800, Zhou Wang wrote: > > > +static int uacce_pin_page(struct uacce_pin_container *priv, > > + struct uacce_pin_address *addr) > > +{ > > + unsigned int flags = FOLL_FORCE | FOLL_WRITE; > > + unsigned long first, last, nr_pages; > > + struct page **pages; > > + struct pin_pages *p; > > + int ret; > > + > > + first = (addr->addr & PAGE_MASK) >> PAGE_SHIFT; > > + last = ((addr->addr + addr->size - 1) & PAGE_MASK) >> PAGE_SHIFT; > > + nr_pages = last - first + 1; > > + > > + pages = vmalloc(nr_pages * sizeof(struct page *)); > > + if (!pages) > > + return -ENOMEM; > > + > > + p = kzalloc(sizeof(*p), GFP_KERNEL); > > + if (!p) { > > + ret = -ENOMEM; > > + goto free; > > + } > > + > > + ret = pin_user_pages_fast(addr->addr & PAGE_MASK, nr_pages, > > + flags | FOLL_LONGTERM, pages); > > This needs to copy the RLIMIT_MEMLOCK and can_do_mlock() stuff from > other places, like ib_umem_get > > > + ret = xa_err(xa_store(&priv->array, p->first, p, GFP_KERNEL)); > > And this is really weird, I don't think it makes sense to make handles > for DMA based on the starting VA. > > > +static int uacce_unpin_page(struct uacce_pin_container *priv, > > + struct uacce_pin_address *addr) > > +{ > > + unsigned long first, last, nr_pages; > > + struct pin_pages *p; > > + > > + first = (addr->addr & PAGE_MASK) >> PAGE_SHIFT; > > + last = ((addr->addr + addr->size - 1) & PAGE_MASK) >> PAGE_SHIFT; > > + nr_pages = last - first + 1; > > + > > + /* find pin_pages */ > > + p = xa_load(&priv->array, first); > > + if (!p) > > + return -ENODEV; > > + > > + if (p->nr_pages != nr_pages) > > + return -EINVAL; > > + > > + /* unpin */ > > + unpin_user_pages(p->pages, p->nr_pages); > > And unpinning without guaranteeing there is no ongoing DMA is really > weird In SVA case, kernel has no idea if accelerators are accessing the memory so I would assume SVA has a method to prevent the pages being transferred from migration or release. Otherwise, SVA will crash easily in a system with high memory pressure. Anyway, This is a problem worth further investigating. > > Are you abusing this in conjunction with a SVA scheme just to prevent > page motion? Why wasn't mlock good enough? Page migration won't cause any disfunction in SVA case as IO page fault will get a valid page again. It is only a performance issue as IO page fault has larger latency than the usual page fault, would be 3-80ï‚´slower than page fault[1] mlock, while certainly be able to prevent swapping out, it won't be able to stop page moving due to: * memory compaction in alloc_pages() * making huge pages * numa balance * memory compaction in CMA etc. [1] https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7482091&tag=1 > > Jason Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH RESEND] dma-mapping: benchmark: fix kernel crash when dma_map_single fails
if dma_map_single() fails, kernel will give the below oops since task_struct has been destroyed and we are running into the memory corruption due to use-after-free in kthread_stop(): [ 48.095310] Unable to handle kernel paging request at virtual address 00c473548040 [ 48.095736] Mem abort info: [ 48.095864] ESR = 0x9604 [ 48.096025] EC = 0x25: DABT (current EL), IL = 32 bits [ 48.096268] SET = 0, FnV = 0 [ 48.096401] EA = 0, S1PTW = 0 [ 48.096538] Data abort info: [ 48.096659] ISV = 0, ISS = 0x0004 [ 48.096820] CM = 0, WnR = 0 [ 48.097079] user pgtable: 4k pages, 48-bit VAs, pgdp=000104639000 [ 48.098099] [00c473548040] pgd=, p4d= [ 48.098832] Internal error: Oops: 9604 [#1] PREEMPT SMP [ 48.099232] Modules linked in: [ 48.099387] CPU: 0 PID: 2 Comm: kthreadd Tainted: GW [ 48.099887] Hardware name: linux,dummy-virt (DT) [ 48.100078] pstate: 6005 (nZCv daif -PAN -UAO -TCO BTYPE=--) [ 48.100516] pc : __kmalloc_node+0x214/0x368 [ 48.100944] lr : __kmalloc_node+0x1f4/0x368 [ 48.101458] sp : 800011f0bb80 [ 48.101843] x29: 800011f0bb80 x28: c0098ec0 [ 48.102330] x27: x26: 001d4600 [ 48.102648] x25: c0098ec0 x24: 800011b6a000 [ 48.102988] x23: x22: c0098ec0 [ 48.10] x21: 8000101d7a54 x20: 0dc0 [ 48.103657] x19: c0001e00 x18: [ 48.104069] x17: x16: [ 48.105449] x15: 01aa0304e7b9 x14: 03b1 [ 48.106401] x13: 8000122d5000 x12: 80001228d000 [ 48.107296] x11: c0154340 x10: [ 48.107862] x9 : 8fff x8 : c473527f [ 48.108326] x7 : 800011e62f58 x6 : c01c8ed8 [ 48.108778] x5 : c0098ec0 x4 : [ 48.109223] x3 : 001d4600 x2 : 0040 [ 48.109656] x1 : 0001 x0 : ffc473548000 [ 48.110104] Call trace: [ 48.110287] __kmalloc_node+0x214/0x368 [ 48.110493] __vmalloc_node_range+0xc4/0x298 [ 48.110805] copy_process+0x2c8/0x15c8 [ 48.33] kernel_clone+0x5c/0x3c0 [ 48.111373] kernel_thread+0x64/0x90 [ 48.111604] kthreadd+0x158/0x368 [ 48.111810] ret_from_fork+0x10/0x30 [ 48.112336] Code: 17e9 b9402a62 b94008a1 11000421 (f8626802) [ 48.112884] ---[ end trace d4890e21e75419d5 ]--- Signed-off-by: Barry Song --- kernel/dma/map_benchmark.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c index b1496e744c68..1b1b8ff875cb 100644 --- a/kernel/dma/map_benchmark.c +++ b/kernel/dma/map_benchmark.c @@ -147,8 +147,10 @@ static int do_map_benchmark(struct map_benchmark_data *map) atomic64_set(&map->sum_sq_unmap, 0); atomic64_set(&map->loops, 0); - for (i = 0; i < threads; i++) + for (i = 0; i < threads; i++) { + get_task_struct(tsk[i]); wake_up_process(tsk[i]); + } msleep_interruptible(map->bparam.seconds * 1000); @@ -183,6 +185,8 @@ static int do_map_benchmark(struct map_benchmark_data *map) } out: + for (i = 0; i < threads; i++) + put_task_struct(tsk[i]); put_device(map->dev); kfree(tsk); return ret; -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] dma-mapping: benchmark: fix kernel crash when dma_map_single fails
if dma_map_single() fails, kernel will give the below oops since task_struct has been destroyed and we are running into the memory corruption due to use-after-free in kthread_stop(): [ 48.095310] Unable to handle kernel paging request at virtual address 00c473548040 [ 48.095736] Mem abort info: [ 48.095864] ESR = 0x9604 [ 48.096025] EC = 0x25: DABT (current EL), IL = 32 bits [ 48.096268] SET = 0, FnV = 0 [ 48.096401] EA = 0, S1PTW = 0 [ 48.096538] Data abort info: [ 48.096659] ISV = 0, ISS = 0x0004 [ 48.096820] CM = 0, WnR = 0 [ 48.097079] user pgtable: 4k pages, 48-bit VAs, pgdp=000104639000 [ 48.098099] [00c473548040] pgd=, p4d= [ 48.098832] Internal error: Oops: 9604 [#1] PREEMPT SMP [ 48.099232] Modules linked in: [ 48.099387] CPU: 0 PID: 2 Comm: kthreadd Tainted: GW [ 48.099887] Hardware name: linux,dummy-virt (DT) [ 48.100078] pstate: 6005 (nZCv daif -PAN -UAO -TCO BTYPE=--) [ 48.100516] pc : __kmalloc_node+0x214/0x368 [ 48.100944] lr : __kmalloc_node+0x1f4/0x368 [ 48.101458] sp : 800011f0bb80 [ 48.101843] x29: 800011f0bb80 x28: c0098ec0 [ 48.102330] x27: x26: 001d4600 [ 48.102648] x25: c0098ec0 x24: 800011b6a000 [ 48.102988] x23: x22: c0098ec0 [ 48.10] x21: 8000101d7a54 x20: 0dc0 [ 48.103657] x19: c0001e00 x18: [ 48.104069] x17: x16: [ 48.105449] x15: 01aa0304e7b9 x14: 03b1 [ 48.106401] x13: 8000122d5000 x12: 80001228d000 [ 48.107296] x11: c0154340 x10: [ 48.107862] x9 : 8fff x8 : c473527f [ 48.108326] x7 : 800011e62f58 x6 : c01c8ed8 [ 48.108778] x5 : c0098ec0 x4 : [ 48.109223] x3 : 001d4600 x2 : 0040 [ 48.109656] x1 : 0001 x0 : ffc473548000 [ 48.110104] Call trace: [ 48.110287] __kmalloc_node+0x214/0x368 [ 48.110493] __vmalloc_node_range+0xc4/0x298 [ 48.110805] copy_process+0x2c8/0x15c8 [ 48.33] kernel_clone+0x5c/0x3c0 [ 48.111373] kernel_thread+0x64/0x90 [ 48.111604] kthreadd+0x158/0x368 [ 48.111810] ret_from_fork+0x10/0x30 [ 48.112336] Code: 17e9 b9402a62 b94008a1 11000421 (f8626802) [ 48.112884] ---[ end trace d4890e21e75419d5 ]--- Signed-off-by: Barry Song --- kernel/dma/map_benchmark.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c index b1496e744c68..1b1b8ff875cb 100644 --- a/kernel/dma/map_benchmark.c +++ b/kernel/dma/map_benchmark.c @@ -147,8 +147,10 @@ static int do_map_benchmark(struct map_benchmark_data *map) atomic64_set(&map->sum_sq_unmap, 0); atomic64_set(&map->loops, 0); - for (i = 0; i < threads; i++) + for (i = 0; i < threads; i++) { + get_task_struct(tsk[i]); wake_up_process(tsk[i]); + } msleep_interruptible(map->bparam.seconds * 1000); @@ -183,6 +185,8 @@ static int do_map_benchmark(struct map_benchmark_data *map) } out: + for (i = 0; i < threads; i++) + put_task_struct(tsk[i]); put_device(map->dev); kfree(tsk); return ret; -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH] dma-mapping: benchmark: check the validity of dma mask bits
> -Original Message- > From: Robin Murphy [mailto:robin.mur...@arm.com] > Sent: Saturday, December 19, 2020 7:10 AM > To: Song Bao Hua (Barry Song) ; h...@lst.de; > m.szyprow...@samsung.com > Cc: iommu@lists.linux-foundation.org; Linuxarm ; Dan > Carpenter > Subject: Re: [PATCH] dma-mapping: benchmark: check the validity of dma mask > bits > > On 2020-12-12 10:18, Barry Song wrote: > > While dma_mask_bits is larger than 64, the bahvaiour is undefined. On the > > other hand, dma_mask_bits which is smaller than 20 (1MB) makes no sense > > in real hardware. > > > > Reported-by: Dan Carpenter > > Signed-off-by: Barry Song > > --- > > kernel/dma/map_benchmark.c | 6 ++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c > > index b1496e744c68..19f661692073 100644 > > --- a/kernel/dma/map_benchmark.c > > +++ b/kernel/dma/map_benchmark.c > > @@ -214,6 +214,12 @@ static long map_benchmark_ioctl(struct file *file, > unsigned int cmd, > > return -EINVAL; > > } > > > > + if (map->bparam.dma_bits < 20 || > > FWIW I don't think we need to bother with a lower limit here - it's > unsigned, and a pointlessly small value will fail gracefully when we > come to actually set the mask anyway. We only need to protect kernel > code from going wrong, not userspace from being stupid to its own detriment. I am not sure if kernel driver can reject small dma mask bit if drivers don't handle it properly. As a month ago, when I was debugging dma map benchmark, I set a value less than 32 to devices behind arm-smmu-v3, it could always succeed. But dma_map_single() was always failing. At that time, I didn't debug this issue. Not sure the latest status of iommu driver. drivers/iommu/intel/iommu.c used to have a dma_supported() to reject small dma_mask: static const struct dma_map_ops bounce_dma_ops = { ... .dma_supported = dma_direct_supported, }; > > Robin. > > > + map->bparam.dma_bits > 64) { > > + pr_err("invalid dma_bits\n"); > > + return -EINVAL; > > + } > > + > > if (map->bparam.node != NUMA_NO_NODE && > > !node_possible(map->bparam.node)) { > > pr_err("invalid numa node\n"); > > Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v2] dma-mapping: add unlikely hint for error path in dma_mapping_error
> -Original Message- > From: Heiner Kallweit [mailto:hkallwe...@gmail.com] > Sent: Monday, December 14, 2020 5:33 AM > To: Christoph Hellwig ; Marek Szyprowski > ; Robin Murphy ; Song Bao Hua > (Barry Song) > Cc: open list:AMD IOMMU (AMD-VI) ; Linux > Kernel Mailing List > Subject: [PATCH v2] dma-mapping: add unlikely hint for error path in > dma_mapping_error > > Zillions of drivers use the unlikely() hint when checking the result of > dma_mapping_error(). This is an inline function anyway, so we can move > the hint into this function and remove it from drivers. > > Signed-off-by: Heiner Kallweit not sure if this is really necessary. It seems the original code is more readable. Readers can more easily understand we are predicting the branch based on the return value of dma_mapping_error(). Anyway, I don't object to this one. if other people like it, I am also ok with it. > --- > v2: > Split the big patch into the change for dma-mapping.h and follow-up > patches per subsystem that will go through the trees of the respective > maintainers. > --- > include/linux/dma-mapping.h | 2 +- > kernel/dma/map_benchmark.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h > index 2e49996a8..6177e20b5 100644 > --- a/include/linux/dma-mapping.h > +++ b/include/linux/dma-mapping.h > @@ -95,7 +95,7 @@ static inline int dma_mapping_error(struct device *dev, > dma_addr_t dma_addr) > { > debug_dma_mapping_error(dev, dma_addr); > > - if (dma_addr == DMA_MAPPING_ERROR) > + if (unlikely(dma_addr == DMA_MAPPING_ERROR)) > return -ENOMEM; > return 0; > } > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c > index b1496e744..901420a5d 100644 > --- a/kernel/dma/map_benchmark.c > +++ b/kernel/dma/map_benchmark.c > @@ -78,7 +78,7 @@ static int map_benchmark_thread(void *data) > > map_stime = ktime_get(); > dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE, map->dir); > - if (unlikely(dma_mapping_error(map->dev, dma_addr))) { > + if (dma_mapping_error(map->dev, dma_addr)) { > pr_err("dma_map_single failed on %s\n", > dev_name(map->dev)); > ret = -ENOMEM; > -- > 2.29.2 Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] dma-mapping: benchmark: check the validity of dma mask bits
While dma_mask_bits is larger than 64, the bahvaiour is undefined. On the other hand, dma_mask_bits which is smaller than 20 (1MB) makes no sense in real hardware. Reported-by: Dan Carpenter Signed-off-by: Barry Song --- kernel/dma/map_benchmark.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c index b1496e744c68..19f661692073 100644 --- a/kernel/dma/map_benchmark.c +++ b/kernel/dma/map_benchmark.c @@ -214,6 +214,12 @@ static long map_benchmark_ioctl(struct file *file, unsigned int cmd, return -EINVAL; } + if (map->bparam.dma_bits < 20 || + map->bparam.dma_bits > 64) { + pr_err("invalid dma_bits\n"); + return -EINVAL; + } + if (map->bparam.node != NUMA_NO_NODE && !node_possible(map->bparam.node)) { pr_err("invalid numa node\n"); -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [bug report] dma-mapping: add benchmark support for streaming DMA APIs
> -Original Message- > From: Dan Carpenter [mailto:dan.carpen...@oracle.com] > Sent: Wednesday, December 9, 2020 8:00 PM > To: Song Bao Hua (Barry Song) > Cc: iommu@lists.linux-foundation.org > Subject: [bug report] dma-mapping: add benchmark support for streaming DMA > APIs > > Hello Barry Song, > > The patch 65789daa8087: "dma-mapping: add benchmark support for > streaming DMA APIs" from Nov 16, 2020, leads to the following static > checker warning: > > kernel/dma/map_benchmark.c:241 map_benchmark_ioctl() > error: undefined (user controlled) shift '1 << (map->bparam.dma_bits)' > > kernel/dma/map_benchmark.c >191 static long map_benchmark_ioctl(struct file *file, unsigned int cmd, >192 unsigned long arg) >193 { >194 struct map_benchmark_data *map = file->private_data; >195 void __user *argp = (void __user *)arg; >196 u64 old_dma_mask; >197 >198 int ret; >199 >200 if (copy_from_user(&map->bparam, argp, sizeof(map->bparam))) >^ > Comes from the user > >201 return -EFAULT; >202 >203 switch (cmd) { >204 case DMA_MAP_BENCHMARK: >205 if (map->bparam.threads == 0 || >206 map->bparam.threads > DMA_MAP_MAX_THREADS) { >207 pr_err("invalid thread number\n"); >208 return -EINVAL; >209 } >210 >211 if (map->bparam.seconds == 0 || >212 map->bparam.seconds > DMA_MAP_MAX_SECONDS) { >213 pr_err("invalid duration seconds\n"); >214 return -EINVAL; >215 } >216 >217 if (map->bparam.node != NUMA_NO_NODE && >218 !node_possible(map->bparam.node)) { >219 pr_err("invalid numa node\n"); >220 return -EINVAL; >221 } >222 >223 switch (map->bparam.dma_dir) { >224 case DMA_MAP_BIDIRECTIONAL: >225 map->dir = DMA_BIDIRECTIONAL; >226 break; >227 case DMA_MAP_FROM_DEVICE: >228 map->dir = DMA_FROM_DEVICE; >229 break; >230 case DMA_MAP_TO_DEVICE: >231 map->dir = DMA_TO_DEVICE; >232 break; >233 default: >234 pr_err("invalid DMA direction\n"); >235 return -EINVAL; >236 } >237 >238 old_dma_mask = dma_get_mask(map->dev); >239 >240 ret = dma_set_mask(map->dev, >241 > DMA_BIT_MASK(map->bparam.dma_bits)); >^^ > If this is more than 31 then the behavior is undefined (but in real life > it will shift wrap). Guess it should be less than 64? For 64, it would be ~0ULL, otherwise, it will be 1ULL<https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=7679325702 I have some code like: + /* suppose the mininum DMA zone is 1MB in the world */ + if (bits < 20 || bits > 64) { + fprintf(stderr, "invalid dma mask bit, must be in 20-64\n"); + exit(1); + } Maybe I should do the same thing in kernel as well. > >242 if (ret) { >243 pr_err("failed to set dma_mask on device > %s\n", >244 dev_name(map->dev)); >245 return -EINVAL; >246 } >247 >248 ret = do_map_benchmark(map); >249 >250 /* >251 * restore the original dma_mask as many devices' > dma_mask > are >252 * set by architectures, acpi, busses. When we bind > them > back >253 * to their original drivers, those drivers shouldn't > see >254 * dma_mask changed by benchmark >255 */ >256 dma_set_mask(map->dev, old_dma_mask); >257
RE: [PATCH] dma-mapping: Fix sizeof() mismatch on tsk allocation
> -Original Message- > From: Colin King [mailto:colin.k...@canonical.com] > Sent: Thursday, November 26, 2020 3:05 AM > To: Song Bao Hua (Barry Song) ; Christoph > Hellwig ; Marek Szyprowski ; > Robin Murphy ; iommu@lists.linux-foundation.org > Cc: kernel-janit...@vger.kernel.org; linux-ker...@vger.kernel.org > Subject: [PATCH] dma-mapping: Fix sizeof() mismatch on tsk allocation > > From: Colin Ian King > > An incorrect sizeof() is being used, sizeof(tsk) is not correct, it should > be sizeof(*tsk). Fix it. > > Addresses-Coverity: ("Sizeof not portable (SIZEOF_MISMATCH)") > Fixes: bfd2defed94d ("dma-mapping: add benchmark support for streaming > DMA APIs") > Signed-off-by: Colin Ian King > --- > kernel/dma/map_benchmark.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c > index e1e37603d01b..b1496e744c68 100644 > --- a/kernel/dma/map_benchmark.c > +++ b/kernel/dma/map_benchmark.c > @@ -121,7 +121,7 @@ static int do_map_benchmark(struct > map_benchmark_data *map) > int ret = 0; > int i; > > - tsk = kmalloc_array(threads, sizeof(tsk), GFP_KERNEL); > + tsk = kmalloc_array(threads, sizeof(*tsk), GFP_KERNEL); The size is same. But the change is correct. Acked-by: Barry Song > if (!tsk) > return -ENOMEM; > Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH] dma-mapping: fix an uninitialized pointer read due to typo in argp assignment
> -Original Message- > From: Colin King [mailto:colin.k...@canonical.com] > Sent: Thursday, November 26, 2020 2:56 AM > To: Song Bao Hua (Barry Song) ; Christoph > Hellwig ; Marek Szyprowski ; > Robin Murphy ; iommu@lists.linux-foundation.org > Cc: kernel-janit...@vger.kernel.org; linux-ker...@vger.kernel.org > Subject: [PATCH] dma-mapping: fix an uninitialized pointer read due to typo in > argp assignment > > From: Colin Ian King > > The assignment of argp is currently using argp as the source because of > a typo. Fix this by assigning it the value passed in arg instead. > > Addresses-Coverity: ("Uninitialized pointer read") > Fixes: bfd2defed94d ("dma-mapping: add benchmark support for streaming > DMA APIs") > Signed-off-by: Colin Ian King Acked-by: Barry Song > --- > kernel/dma/map_benchmark.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c > index ca616b664f72..e1e37603d01b 100644 > --- a/kernel/dma/map_benchmark.c > +++ b/kernel/dma/map_benchmark.c > @@ -192,7 +192,7 @@ static long map_benchmark_ioctl(struct file *file, > unsigned int cmd, > unsigned long arg) > { > struct map_benchmark_data *map = file->private_data; > - void __user *argp = (void __user *)argp; > + void __user *argp = (void __user *)arg; > u64 old_dma_mask; > > int ret; > -- > 2.29.2 Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 1/2] dma-mapping: add benchmark support for streaming DMA APIs
Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU. This patch enables the support. Users can run specified number of threads to do dma_map_page and dma_unmap_page on a specific NUMA node with the specified duration. Then dma_map_benchmark will calculate the average latency for map and unmap. A difficulity for this benchmark is that dma_map/unmap APIs must run on a particular device. Each device might have different backend of IOMMU or non-IOMMU. So we use the driver_override to bind dma_map_benchmark to a particual device by: For platform devices: echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind For PCI devices: echo dma_map_benchmark > /sys/bus/pci/devices/:00:01.0/driver_override echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo :00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind Cc: Will Deacon Cc: Shuah Khan Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Robin Murphy Signed-off-by: Barry Song --- -v4: * add dma direction support according to Christoph Hellwig's comment; * add dma mask bit set according to Christoph Hellwig's comment; * make the benchmark depend on DEBUG_FS according to John Garry's comment; * strictly check parameters in ioctl; * fixed more than 80 char in one line; kernel/dma/Kconfig | 9 + kernel/dma/Makefile| 1 + kernel/dma/map_benchmark.c | 361 + 3 files changed, 371 insertions(+) create mode 100644 kernel/dma/map_benchmark.c diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index c99de4a21458..07f30651b83d 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -225,3 +225,12 @@ config DMA_API_DEBUG_SG is technically out-of-spec. If unsure, say N. + +config DMA_MAP_BENCHMARK + bool "Enable benchmarking of streaming DMA mapping" + depends on DEBUG_FS + help + Provides /sys/kernel/debug/dma_map_benchmark that helps with testing + performance of dma_(un)map_page. + + See tools/testing/selftests/dma/dma_map_benchmark.c diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index dc755ab68aab..7aa6b26b1348 100644 --- a/kernel/dma/Makefile +++ b/kernel/dma/Makefile @@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG) += debug.o obj-$(CONFIG_SWIOTLB) += swiotlb.o obj-$(CONFIG_DMA_COHERENT_POOL)+= pool.o obj-$(CONFIG_DMA_REMAP)+= remap.o +obj-$(CONFIG_DMA_MAP_BENCHMARK)+= map_benchmark.o diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c new file mode 100644 index ..41d44a75adb2 --- /dev/null +++ b/kernel/dma/map_benchmark.c @@ -0,0 +1,361 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2020 Hisilicon Limited. + */ + +#define pr_fmt(fmt)KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) +#define DMA_MAP_MAX_THREADS1024 +#define DMA_MAP_MAX_SECONDS300 + +#define DMA_MAP_BIDIRECTIONAL 0 +#define DMA_MAP_TO_DEVICE 1 +#define DMA_MAP_FROM_DEVICE2 + +struct map_benchmark { + __u64 avg_map_100ns; /* average map latency in 100ns */ + __u64 map_stddev; /* standard deviation of map latency */ + __u64 avg_unmap_100ns; /* as above */ + __u64 unmap_stddev; + __u32 threads; /* how many threads will do map/unmap in parallel */ + __u32 seconds; /* how long the test will last */ + __s32 node; /* which numa node this benchmark will run on */ + __u32 dma_bits; /* DMA addressing capability */ + __u32 dma_dir; /* DMA data direction */ + __u64 expansion[10];/* For future use */ +}; + +struct map_benchmark_data { + struct map_benchmark bparam; + struct device *dev; + struct dentry *debugfs; + enum dma_data_direction dir; + atomic64_t sum_map_100ns; + atomic64_t sum_unmap_100ns; + atomic64_t sum_sq_map; + atomic64_t sum_sq_unmap; + atomic64_t loops; +}; + +static int map_benchmark_thread(void *data) +{ + void *buf; + dma_addr_t dma_addr; + struct map_benchmark_data *map = data; + int ret = 0; + + buf = (void *)__get_free_page(GFP_KERNEL); + if (!buf) + return -ENOMEM; + + while (!kthread_should_stop()) { + u64 map_100ns, unmap_100ns, map_sq, unmap_sq; + ktime_t map_stime, map_etime, unmap_stime, unmap_etime; + ktime_t map_delta, unmap_delta; + + /* +* for a non-coherent device, if we don&
[PATCH v4 0/2] dma-mapping: provide a benchmark for streaming DMA mapping
Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU. This patchset provides the benchmark infrastruture for streaming DMA mapping. The architecture of the code is pretty much similar with GUP benchmark: * mm/gup_benchmark.c provides kernel interface; * tools/testing/selftests/vm/gup_benchmark.c provides user program to call the interface provided by mm/gup_benchmark.c. In our case, kernel/dma/map_benchmark.c is like mm/gup_benchmark.c; tools/testing/selftests/dma/dma_map_benchmark.c is like tools/testing/ selftests/vm/gup_benchmark.c A major difference with GUP benchmark is DMA_MAP benchmark needs to run on a device. Considering one board with below devices and IOMMUs device A --- IOMMU 1 device B --- IOMMU 2 device C --- non-IOMMU Different devices might attach to different IOMMU or non-IOMMU. To make benchmark run, we can either * create a virtual device and hack the kernel code to attach the virtual device to IOMMU1, IOMMU2 or non-IOMMU. * use the existing driver_override mechinism, unbind device A,B, OR c from their original driver and bind A to dma_map_benchmark platform driver or pci driver for benchmarking. In this patchset, I prefer to use the driver_override and avoid the ugly hack in kernel. We can dynamically switch device behind different IOMMUs to get the performance of IOMMU or non-IOMMU. -v4: * add dma direction support according to Christoph Hellwig's comment; * add dma mask bit set according to Christoph Hellwig's comment; * make the benchmark depend on DEBUG_FS according to John Garry's comment; * strictly check parameters in ioctl -v3: * fix build issues reported by 0day kernel test robot -v2: * add PCI support; v1 supported platform devices only * replace ssleep by msleep_interruptible() to permit users to exit benchmark before it is completed * many changes according to Robin's suggestions, thanks! Robin - add standard deviation output to reflect the worst case - check users' parameters strictly like the number of threads - make cache dirty before dma_map - fix unpaired dma_map_page and dma_unmap_single; - remove redundant "long long" before ktime_to_ns(); - use devm_add_action() Barry Song (2): dma-mapping: add benchmark support for streaming DMA APIs selftests/dma: add test application for DMA_MAP_BENCHMARK MAINTAINERS | 6 + kernel/dma/Kconfig| 9 + kernel/dma/Makefile | 1 + kernel/dma/map_benchmark.c| 361 ++ tools/testing/selftests/dma/Makefile | 6 + tools/testing/selftests/dma/config| 1 + .../testing/selftests/dma/dma_map_benchmark.c | 123 ++ 7 files changed, 507 insertions(+) create mode 100644 kernel/dma/map_benchmark.c create mode 100644 tools/testing/selftests/dma/Makefile create mode 100644 tools/testing/selftests/dma/config create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 2/2] selftests/dma: add test application for DMA_MAP_BENCHMARK
This patch provides the test application for DMA_MAP_BENCHMARK. Before running the test application, we need to bind a device to dma_map_ benchmark driver. For example, unbind "xxx" from its original driver and bind to dma_map_benchmark: echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind Another example for PCI devices: echo dma_map_benchmark > /sys/bus/pci/devices/:00:01.0/driver_override echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo :00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind The below command will run 16 threads on numa node 0 for 10 seconds on the device bound to dma_map_benchmark platform_driver or pci_driver: ./dma_map_benchmark -t 16 -s 10 -n 0 dma mapping benchmark: threads:16 seconds:10 average map latency(us):1.1 standard deviation:1.9 average unmap latency(us):0.5 standard deviation:0.8 Cc: Will Deacon Cc: Shuah Khan Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Robin Murphy Signed-off-by: Barry Song --- -v4: * add dma direction and mask_bit parameters MAINTAINERS | 6 + tools/testing/selftests/dma/Makefile | 6 + tools/testing/selftests/dma/config| 1 + .../testing/selftests/dma/dma_map_benchmark.c | 123 ++ 4 files changed, 136 insertions(+) create mode 100644 tools/testing/selftests/dma/Makefile create mode 100644 tools/testing/selftests/dma/config create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c diff --git a/MAINTAINERS b/MAINTAINERS index e451dcce054f..bc851ffd3114 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5247,6 +5247,12 @@ F: include/linux/dma-mapping.h F: include/linux/dma-map-ops.h F: kernel/dma/ +DMA MAPPING BENCHMARK +M: Barry Song +L: iommu@lists.linux-foundation.org +F: kernel/dma/map_benchmark.c +F: tools/testing/selftests/dma/ + DMA-BUF HEAPS FRAMEWORK M: Sumit Semwal R: Benjamin Gaignard diff --git a/tools/testing/selftests/dma/Makefile b/tools/testing/selftests/dma/Makefile new file mode 100644 index ..aa8e8b5b3864 --- /dev/null +++ b/tools/testing/selftests/dma/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0 +CFLAGS += -I../../../../usr/include/ + +TEST_GEN_PROGS := dma_map_benchmark + +include ../lib.mk diff --git a/tools/testing/selftests/dma/config b/tools/testing/selftests/dma/config new file mode 100644 index ..6102ee3c43cd --- /dev/null +++ b/tools/testing/selftests/dma/config @@ -0,0 +1 @@ +CONFIG_DMA_MAP_BENCHMARK=y diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c b/tools/testing/selftests/dma/dma_map_benchmark.c new file mode 100644 index ..7065163a8388 --- /dev/null +++ b/tools/testing/selftests/dma/dma_map_benchmark.c @@ -0,0 +1,123 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2020 Hisilicon Limited. + */ + +#include +#include +#include +#include +#include +#include +#include + +#define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) +#define DMA_MAP_MAX_THREADS1024 +#define DMA_MAP_MAX_SECONDS 300 + +#define DMA_MAP_BIDIRECTIONAL 0 +#define DMA_MAP_TO_DEVICE 1 +#define DMA_MAP_FROM_DEVICE2 + +static char *directions[] = { + "BIDIRECTIONAL", + "TO_DEVICE", + "FROM_DEVICE", +}; + +struct map_benchmark { + __u64 avg_map_100ns; /* average map latency in 100ns */ + __u64 map_stddev; /* standard deviation of map latency */ + __u64 avg_unmap_100ns; /* as above */ + __u64 unmap_stddev; + __u32 threads; /* how many threads will do map/unmap in parallel */ + __u32 seconds; /* how long the test will last */ + __s32 node; /* which numa node this benchmark will run on */ + __u32 dma_bits; /* DMA addressing capability */ + __u32 dma_dir; /* DMA data direction */ + __u64 expansion[10];/* For future use */ +}; + +int main(int argc, char **argv) +{ + struct map_benchmark map; + int fd, opt; + /* default single thread, run 20 seconds on NUMA_NO_NODE */ + int threads = 1, seconds = 20, node = -1; + /* default dma mask 32bit, bidirectional DMA */ + int bits = 32, dir = DMA_MAP_BIDIRECTIONAL; + + int cmd = DMA_MAP_BENCHMARK; + char *p; + + while ((opt = getopt(argc, argv, "t:s:n:b:d:")) != -1) { + switch (opt) { + case 't': + threads = atoi(optarg); + break; + case 's': + seconds = atoi(optarg); + break; + case 'n': + node = atoi(optarg); + break; + case 'b': + b
RE: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs
> -Original Message- > From: Christoph Hellwig [mailto:h...@lst.de] > Sent: Sunday, November 15, 2020 9:45 PM > To: Song Bao Hua (Barry Song) > Cc: Christoph Hellwig ; iommu@lists.linux-foundation.org; > robin.mur...@arm.com; m.szyprow...@samsung.com; Linuxarm > ; linux-kselft...@vger.kernel.org; xuwei (O) > ; Joerg Roedel ; Will Deacon > ; Shuah Khan > Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for > streaming DMA APIs > > On Sun, Nov 15, 2020 at 12:11:15AM +, Song Bao Hua (Barry Song) > wrote: > > > > Checkpatch has changed 80 to 100. That's probably why my local checkpatch > didn't report any warning: > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id= > bdc48fa11e46f867ea4d > > > > I am happy to change them to be less than 80 if you like. > > Don't rely on checkpath, is is broken. Look at the codingstyle document. > > > > I think this needs to set a dma mask as behavior for unlimited dma > > > mask vs the default 32-bit one can be very different. > > > > I actually prefer users bind real devices with real dma_mask to test rather > than force to change > > the dma_mask in this benchmark. > > The mask is set by the driver, not the device. So you need to set when > when you bind, real device or not. Yep while it is a little bit tricky. Sometimes, it is done by "device" in architectures, e.g. there are lots of dma_mask configuration code in arch/arm/mach-xxx. arch/arm/mach-davinci/da850.c static u64 da850_vpif_dma_mask = DMA_BIT_MASK(32); static struct platform_device da850_vpif_dev = { .name = "vpif", .id = -1, .dev= { .dma_mask = &da850_vpif_dma_mask, .coherent_dma_mask = DMA_BIT_MASK(32), }, .resource = da850_vpif_resource, .num_resources = ARRAY_SIZE(da850_vpif_resource), }; Sometimes, it is done by "of" or "acpi", for example: drivers/acpi/arm64/iort.c void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size) { u64 end, mask, dmaaddr = 0, size = 0, offset = 0; int ret; ... ret = acpi_dma_get_range(dev, &dmaaddr, &offset, &size); if (!ret) { /* * Limit coherent and dma mask based on size retrieved from * firmware. */ end = dmaaddr + size - 1; mask = DMA_BIT_MASK(ilog2(end) + 1); dev->bus_dma_limit = end; dev->coherent_dma_mask = mask; *dev->dma_mask = mask; } ... } Sometimes, it is done by "bus", for example, ISA: isa_dev->dev.coherent_dma_mask = DMA_BIT_MASK(24); isa_dev->dev.dma_mask = &isa_dev->dev.coherent_dma_mask; error = device_register(&isa_dev->dev); if (error) { put_device(&isa_dev->dev); break; } And in many cases, it is done by driver. On the ARM64 server platform I am testing, actually rarely drivers set dma_mask. So to make the dma benchmark work on all platforms, it seems it is worth to add a dma_mask_bit parameter. But, in order to avoid breaking the dma_mask of those devices whose dma_mask are set by architectures, acpi and bus, it seems we need to do the below in dma_benchmark: u64 old_mask; old_mask = dma_get_mask(dev); dma_set_mask(dev, &new_mask); do_map_benchmark(); /* restore old dma_mask so that the dma_mask of the device is not changed due to benchmark when it is bound back to its original driver */ dma_set_mask(dev, &old_mask); Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs
> -Original Message- > From: Christoph Hellwig [mailto:h...@lst.de] > Sent: Sunday, November 15, 2020 5:54 AM > To: Song Bao Hua (Barry Song) > Cc: iommu@lists.linux-foundation.org; h...@lst.de; robin.mur...@arm.com; > m.szyprow...@samsung.com; Linuxarm ; > linux-kselft...@vger.kernel.org; xuwei (O) ; Joerg > Roedel ; Will Deacon ; Shuah Khan > > Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for > streaming DMA APIs > > Lots of > 80 char lines. Please fix up the style. Checkpatch has changed 80 to 100. That's probably why my local checkpatch didn't report any warning: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bdc48fa11e46f867ea4d I am happy to change them to be less than 80 if you like. > > I think this needs to set a dma mask as behavior for unlimited dma > mask vs the default 32-bit one can be very different. I actually prefer users bind real devices with real dma_mask to test rather than force to change the dma_mask in this benchmark. Some device might have 32bit dma_mask while some others might have unlimited. But both of them can bind to this driver or unbind from it after the test is done. So users just need to bind those different real devices with different real dma_mask to dma_benchmark. This can reflect the real performance of the real device better, I think. > I also think you need to be able to pass the direction or have different tests > for directions. bidirectional is not exactly heavily used and pays > more cache management penality. For this, I'd like to increase a direction option in the test app and pass the option to the benchmark driver. Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs
> -Original Message- > From: John Garry > Sent: Wednesday, November 11, 2020 10:37 PM > To: Song Bao Hua (Barry Song) ; > iommu@lists.linux-foundation.org; h...@lst.de; robin.mur...@arm.com; > m.szyprow...@samsung.com > Cc: linux-kselft...@vger.kernel.org; Will Deacon ; Joerg > Roedel ; Linuxarm ; xuwei (O) > ; Shuah Khan > Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for > streaming DMA APIs > > On 11/11/2020 01:29, Song Bao Hua (Barry Song) wrote: > > I'd like to think checking this here would be overdesign. We just give > > users the > > freedom to bind any device they care about to the benchmark driver. Usually > > that means a real hardware either behind an IOMMU or through a direct > > mapping. > > > > if for any reason users put a wrong "device", that is the choice of users. > > Right, but if the device simply has no DMA ops supported, it could be > better to fail the probe rather than let them try the test at all. > > Anyhow, > > the below code will still handle it properly and users will get a report in > > which > > everything is zero. > > > > +static int map_benchmark_thread(void *data) > > +{ > > ... > > + dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE, > DMA_BIDIRECTIONAL); > > + if (unlikely(dma_mapping_error(map->dev, dma_addr))) { > > Doing this is proper, but I am not sure if this tells the user the real > problem. Telling users the real problem isn't the design intention of this test benchmark. It is never the purpose of this benchmark. > > > + pr_err("dma_map_single failed on %s\n", > dev_name(map->dev)); > > Not sure why use pr_err() over dev_err(). We are reporting errors in dma-benchmark driver rather than reporting errors in the driver of the specific device. I think we should have "dma-benchmark" as the prefix while printing the name of the device by dev_name(). > > > + ret = -ENOMEM; > > + goto out; > > + } > > Thanks, > John Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs
> -Original Message- > From: John Garry > Sent: Tuesday, November 10, 2020 9:39 PM > To: Song Bao Hua (Barry Song) ; > iommu@lists.linux-foundation.org; h...@lst.de; robin.mur...@arm.com; > m.szyprow...@samsung.com > Cc: linux-kselft...@vger.kernel.org; Will Deacon ; Joerg > Roedel ; Linuxarm ; xuwei (O) > ; Shuah Khan > Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for > streaming DMA APIs > > On 10/11/2020 08:10, Song Bao Hua (Barry Song) wrote: > > Hello Robin, Christoph, > > Any further comment? John suggested that "depends on DEBUG_FS" should > be added in Kconfig. > > I am collecting more comments to send v4 together with fixing this minor > issue :-) > > > > Thanks > > Barry > > > >> -Original Message- > >> From: Song Bao Hua (Barry Song) > >> Sent: Monday, November 2, 2020 9:07 PM > >> To: iommu@lists.linux-foundation.org; h...@lst.de; > robin.mur...@arm.com; > >> m.szyprow...@samsung.com > >> Cc: Linuxarm ; linux-kselft...@vger.kernel.org; > xuwei > >> (O) ; Song Bao Hua (Barry Song) > >> ; Joerg Roedel ; Will > Deacon > >> ; Shuah Khan > >> Subject: [PATCH v3 1/2] dma-mapping: add benchmark support for > streaming > >> DMA APIs > >> > >> Nowadays, there are increasing requirements to benchmark the > performance > >> of dma_map and dma_unmap particually while the device is attached to an > >> IOMMU. > >> > >> This patch enables the support. Users can run specified number of threads > to > >> do dma_map_page and dma_unmap_page on a specific NUMA node with > the > >> specified duration. Then dma_map_benchmark will calculate the average > >> latency for map and unmap. > >> > >> A difficulity for this benchmark is that dma_map/unmap APIs must run on a > >> particular device. Each device might have different backend of IOMMU or > >> non-IOMMU. > >> > >> So we use the driver_override to bind dma_map_benchmark to a particual > >> device by: > >> For platform devices: > >> echo dma_map_benchmark > > /sys/bus/platform/devices/xxx/driver_override > >> echo xxx > /sys/bus/platform/drivers/xxx/unbind > >> echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind > >> > > Hi Barry, > > >> For PCI devices: > >> echo dma_map_benchmark > > >> /sys/bus/pci/devices/:00:01.0/driver_override > >> echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo :00:01.0 > > >> /sys/bus/pci/drivers/dma_map_benchmark/bind > > Do we need to check if the device to which we attach actually has DMA > mapping capability? Hello John, I'd like to think checking this here would be overdesign. We just give users the freedom to bind any device they care about to the benchmark driver. Usually that means a real hardware either behind an IOMMU or through a direct mapping. if for any reason users put a wrong "device", that is the choice of users. Anyhow, the below code will still handle it properly and users will get a report in which everything is zero. +static int map_benchmark_thread(void *data) +{ ... + dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE, DMA_BIDIRECTIONAL); + if (unlikely(dma_mapping_error(map->dev, dma_addr))) { + pr_err("dma_map_single failed on %s\n", dev_name(map->dev)); + ret = -ENOMEM; + goto out; + } ... +} > > >> > >> Cc: Joerg Roedel > >> Cc: Will Deacon > >> Cc: Shuah Khan > >> Cc: Christoph Hellwig > >> Cc: Marek Szyprowski > >> Cc: Robin Murphy > >> Signed-off-by: Barry Song > >> --- > > Thanks, > John Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs
Hello Robin, Christoph, Any further comment? John suggested that "depends on DEBUG_FS" should be added in Kconfig. I am collecting more comments to send v4 together with fixing this minor issue :-) Thanks Barry > -Original Message- > From: Song Bao Hua (Barry Song) > Sent: Monday, November 2, 2020 9:07 PM > To: iommu@lists.linux-foundation.org; h...@lst.de; robin.mur...@arm.com; > m.szyprow...@samsung.com > Cc: Linuxarm ; linux-kselft...@vger.kernel.org; xuwei > (O) ; Song Bao Hua (Barry Song) > ; Joerg Roedel ; Will Deacon > ; Shuah Khan > Subject: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming > DMA APIs > > Nowadays, there are increasing requirements to benchmark the performance > of dma_map and dma_unmap particually while the device is attached to an > IOMMU. > > This patch enables the support. Users can run specified number of threads to > do dma_map_page and dma_unmap_page on a specific NUMA node with the > specified duration. Then dma_map_benchmark will calculate the average > latency for map and unmap. > > A difficulity for this benchmark is that dma_map/unmap APIs must run on a > particular device. Each device might have different backend of IOMMU or > non-IOMMU. > > So we use the driver_override to bind dma_map_benchmark to a particual > device by: > For platform devices: > echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override > echo xxx > /sys/bus/platform/drivers/xxx/unbind > echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind > > For PCI devices: > echo dma_map_benchmark > > /sys/bus/pci/devices/:00:01.0/driver_override > echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo :00:01.0 > > /sys/bus/pci/drivers/dma_map_benchmark/bind > > Cc: Joerg Roedel > Cc: Will Deacon > Cc: Shuah Khan > Cc: Christoph Hellwig > Cc: Marek Szyprowski > Cc: Robin Murphy > Signed-off-by: Barry Song > --- > -v3: > * fix build issues reported by 0day kernel test robot > -v2: > * add PCI support; v1 supported platform devices only > * replace ssleep by msleep_interruptible() to permit users to exit > benchmark before it is completed > * many changes according to Robin's suggestions, thanks! Robin > - add standard deviation output to reflect the worst case > - check users' parameters strictly like the number of threads > - make cache dirty before dma_map > - fix unpaired dma_map_page and dma_unmap_single; > - remove redundant "long long" before ktime_to_ns(); > - use devm_add_action() > > kernel/dma/Kconfig | 8 + > kernel/dma/Makefile| 1 + > kernel/dma/map_benchmark.c | 296 > + > 3 files changed, 305 insertions(+) > create mode 100644 kernel/dma/map_benchmark.c > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index > c99de4a21458..949c53da5991 100644 > --- a/kernel/dma/Kconfig > +++ b/kernel/dma/Kconfig > @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG > is technically out-of-spec. > > If unsure, say N. > + > +config DMA_MAP_BENCHMARK > + bool "Enable benchmarking of streaming DMA mapping" > + help > + Provides /sys/kernel/debug/dma_map_benchmark that helps with > testing > + performance of dma_(un)map_page. > + > + See tools/testing/selftests/dma/dma_map_benchmark.c > diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index > dc755ab68aab..7aa6b26b1348 100644 > --- a/kernel/dma/Makefile > +++ b/kernel/dma/Makefile > @@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG) += debug.o > obj-$(CONFIG_SWIOTLB)+= swiotlb.o > obj-$(CONFIG_DMA_COHERENT_POOL) += pool.o > obj-$(CONFIG_DMA_REMAP) += remap.o > +obj-$(CONFIG_DMA_MAP_BENCHMARK) += map_benchmark.o > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c > new file mode 100644 index ..dc4e5ff48a2d > --- /dev/null > +++ b/kernel/dma/map_benchmark.c > @@ -0,0 +1,296 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * Copyright (C) 2020 Hisilicon Limited. > + */ > + > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#define DMA_MAP_BENCHMARK_IOWR('d', 1, struct map_benchmark) > +#define DMA_MAP_MAX_THREADS 1024 > +#define DMA_MAP_MAX_SECONDS 300 > + > +struct map_benchmark { > + __u64 avg_map_100ns; /* average map l
RE: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs
> -Original Message- > From: John Garry > Sent: Monday, November 2, 2020 10:19 PM > To: Song Bao Hua (Barry Song) ; > iommu@lists.linux-foundation.org; h...@lst.de; robin.mur...@arm.com; > m.szyprow...@samsung.com > Cc: linux-kselft...@vger.kernel.org; Shuah Khan ; Joerg > Roedel ; Linuxarm ; xuwei (O) > ; Will Deacon > Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for > streaming DMA APIs > > On 02/11/2020 08:06, Barry Song wrote: > > Nowadays, there are increasing requirements to benchmark the performance > > of dma_map and dma_unmap particually while the device is attached to an > > IOMMU. > > > > This patch enables the support. Users can run specified number of threads > > to do dma_map_page and dma_unmap_page on a specific NUMA node with > the > > specified duration. Then dma_map_benchmark will calculate the average > > latency for map and unmap. > > > > A difficulity for this benchmark is that dma_map/unmap APIs must run on > > a particular device. Each device might have different backend of IOMMU or > > non-IOMMU. > > > > So we use the driver_override to bind dma_map_benchmark to a particual > > device by: > > For platform devices: > > echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override > > echo xxx > /sys/bus/platform/drivers/xxx/unbind > > echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind > > > > For PCI devices: > > echo dma_map_benchmark > > /sys/bus/pci/devices/:00:01.0/driver_override > > echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind > > echo :00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind > > > > Cc: Joerg Roedel > > Cc: Will Deacon > > Cc: Shuah Khan > > Cc: Christoph Hellwig > > Cc: Marek Szyprowski > > Cc: Robin Murphy > > Signed-off-by: Barry Song > > --- > > -v3: > >* fix build issues reported by 0day kernel test robot > > -v2: > >* add PCI support; v1 supported platform devices only > >* replace ssleep by msleep_interruptible() to permit users to exit > > benchmark before it is completed > >* many changes according to Robin's suggestions, thanks! Robin > > - add standard deviation output to reflect the worst case > > - check users' parameters strictly like the number of threads > > - make cache dirty before dma_map > > - fix unpaired dma_map_page and dma_unmap_single; > > - remove redundant "long long" before ktime_to_ns(); > > - use devm_add_action() > > > > kernel/dma/Kconfig | 8 + > > kernel/dma/Makefile| 1 + > > kernel/dma/map_benchmark.c | 296 > + > > 3 files changed, 305 insertions(+) > > create mode 100644 kernel/dma/map_benchmark.c > > > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig > > index c99de4a21458..949c53da5991 100644 > > --- a/kernel/dma/Kconfig > > +++ b/kernel/dma/Kconfig > > @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG > > is technically out-of-spec. > > > > If unsure, say N. > > + > > +config DMA_MAP_BENCHMARK > > + bool "Enable benchmarking of streaming DMA mapping" > > + help > > + Provides /sys/kernel/debug/dma_map_benchmark that helps with > testing > > + performance of dma_(un)map_page. > > Since this is a driver, any reason for which it cannot be loadable? If > so, it seems any functionality would depend on DEBUG FS, I figure that's > just how we work for debugfs. We depend on kthread_bind_mask which isn't an export_symbol. Maybe worth to send a patch to export it? > > Thanks, > John > > > + > > + See tools/testing/selftests/dma/dma_map_benchmark.c > > diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile > > index dc755ab68aab..7aa6b26b1348 100644 > > --- a/kernel/dma/Makefile > > +++ b/kernel/dma/Makefile Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs
Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU. This patch enables the support. Users can run specified number of threads to do dma_map_page and dma_unmap_page on a specific NUMA node with the specified duration. Then dma_map_benchmark will calculate the average latency for map and unmap. A difficulity for this benchmark is that dma_map/unmap APIs must run on a particular device. Each device might have different backend of IOMMU or non-IOMMU. So we use the driver_override to bind dma_map_benchmark to a particual device by: For platform devices: echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind For PCI devices: echo dma_map_benchmark > /sys/bus/pci/devices/:00:01.0/driver_override echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo :00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind Cc: Joerg Roedel Cc: Will Deacon Cc: Shuah Khan Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Robin Murphy Signed-off-by: Barry Song --- -v3: * fix build issues reported by 0day kernel test robot -v2: * add PCI support; v1 supported platform devices only * replace ssleep by msleep_interruptible() to permit users to exit benchmark before it is completed * many changes according to Robin's suggestions, thanks! Robin - add standard deviation output to reflect the worst case - check users' parameters strictly like the number of threads - make cache dirty before dma_map - fix unpaired dma_map_page and dma_unmap_single; - remove redundant "long long" before ktime_to_ns(); - use devm_add_action() kernel/dma/Kconfig | 8 + kernel/dma/Makefile| 1 + kernel/dma/map_benchmark.c | 296 + 3 files changed, 305 insertions(+) create mode 100644 kernel/dma/map_benchmark.c diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index c99de4a21458..949c53da5991 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG is technically out-of-spec. If unsure, say N. + +config DMA_MAP_BENCHMARK + bool "Enable benchmarking of streaming DMA mapping" + help + Provides /sys/kernel/debug/dma_map_benchmark that helps with testing + performance of dma_(un)map_page. + + See tools/testing/selftests/dma/dma_map_benchmark.c diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index dc755ab68aab..7aa6b26b1348 100644 --- a/kernel/dma/Makefile +++ b/kernel/dma/Makefile @@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG) += debug.o obj-$(CONFIG_SWIOTLB) += swiotlb.o obj-$(CONFIG_DMA_COHERENT_POOL)+= pool.o obj-$(CONFIG_DMA_REMAP)+= remap.o +obj-$(CONFIG_DMA_MAP_BENCHMARK)+= map_benchmark.o diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c new file mode 100644 index ..dc4e5ff48a2d --- /dev/null +++ b/kernel/dma/map_benchmark.c @@ -0,0 +1,296 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2020 Hisilicon Limited. + */ + +#define pr_fmt(fmt)KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) +#define DMA_MAP_MAX_THREADS1024 +#define DMA_MAP_MAX_SECONDS300 + +struct map_benchmark { + __u64 avg_map_100ns; /* average map latency in 100ns */ + __u64 map_stddev; /* standard deviation of map latency */ + __u64 avg_unmap_100ns; /* as above */ + __u64 unmap_stddev; + __u32 threads; /* how many threads will do map/unmap in parallel */ + __u32 seconds; /* how long the test will last */ + int node; /* which numa node this benchmark will run on */ + __u64 expansion[10];/* For future use */ +}; + +struct map_benchmark_data { + struct map_benchmark bparam; + struct device *dev; + struct dentry *debugfs; + atomic64_t sum_map_100ns; + atomic64_t sum_unmap_100ns; + atomic64_t sum_square_map; + atomic64_t sum_square_unmap; + atomic64_t loops; +}; + +static int map_benchmark_thread(void *data) +{ + void *buf; + dma_addr_t dma_addr; + struct map_benchmark_data *map = data; + int ret = 0; + + buf = (void *)__get_free_page(GFP_KERNEL); + if (!buf) + return -ENOMEM; + + while (!kthread_should_stop()) { + __u64 map_100ns, unmap_100ns, map_square, unmap_square; + ktime_t map_stime, map_etime, unmap_stime, unmap_etime; + + /* +* for a non-coherent d
[PATCH v3 0/2] dma-mapping: provide a benchmark for streaming DMA mapping
Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU. This patchset provides the benchmark infrastruture for streaming DMA mapping. The architecture of the code is pretty much similar with GUP benchmark: * mm/gup_benchmark.c provides kernel interface; * tools/testing/selftests/vm/gup_benchmark.c provides user program to call the interface provided by mm/gup_benchmark.c. In our case, kernel/dma/map_benchmark.c is like mm/gup_benchmark.c; tools/testing/selftests/dma/dma_map_benchmark.c is like tools/testing/ selftests/vm/gup_benchmark.c A major difference with GUP benchmark is DMA_MAP benchmark needs to run on a device. Considering one board with below devices and IOMMUs device A --- IOMMU 1 device B --- IOMMU 2 device C --- non-IOMMU Different devices might attach to different IOMMU or non-IOMMU. To make benchmark run, we can either * create a virtual device and hack the kernel code to attach the virtual device to IOMMU1, IOMMU2 or non-IOMMU. * use the existing driver_override mechinism, unbind device A,B, OR c from their original driver and bind A to dma_map_benchmark platform driver or pci driver for benchmarking. In this patchset, I prefer to use the driver_override and avoid the ugly hack in kernel. We can dynamically switch device behind different IOMMUs to get the performance of IOMMU or non-IOMMU. -v3: * fix build issues reported by 0day kernel test robot -v2: * add PCI support; v1 supported platform devices only * replace ssleep by msleep_interruptible() to permit users to exit benchmark before it is completed * many changes according to Robin's suggestions, thanks! Robin - add standard deviation output to reflect the worst case - check users' parameters strictly like the number of threads - make cache dirty before dma_map - fix unpaired dma_map_page and dma_unmap_single; - remove redundant "long long" before ktime_to_ns(); - use devm_add_action() Barry Song (2): dma-mapping: add benchmark support for streaming DMA APIs selftests/dma: add test application for DMA_MAP_BENCHMARK MAINTAINERS | 6 + kernel/dma/Kconfig| 8 + kernel/dma/Makefile | 1 + kernel/dma/map_benchmark.c| 296 ++ tools/testing/selftests/dma/Makefile | 6 + tools/testing/selftests/dma/config| 1 + .../testing/selftests/dma/dma_map_benchmark.c | 87 + 7 files changed, 405 insertions(+) create mode 100644 kernel/dma/map_benchmark.c create mode 100644 tools/testing/selftests/dma/Makefile create mode 100644 tools/testing/selftests/dma/config create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 2/2] selftests/dma: add test application for DMA_MAP_BENCHMARK
This patch provides the test application for DMA_MAP_BENCHMARK. Before running the test application, we need to bind a device to dma_map_ benchmark driver. For example, unbind "xxx" from its original driver and bind to dma_map_benchmark: echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind Another example for PCI devices: echo dma_map_benchmark > /sys/bus/pci/devices/:00:01.0/driver_override echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo :00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind The below command will run 16 threads on numa node 0 for 10 seconds on the device bound to dma_map_benchmark platform_driver or pci_driver: ./dma_map_benchmark -t 16 -s 10 -n 0 dma mapping benchmark: threads:16 seconds:10 average map latency(us):1.1 standard deviation:1.9 average unmap latency(us):0.5 standard deviation:0.8 Cc: Joerg Roedel Cc: Will Deacon Cc: Shuah Khan Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Robin Murphy Signed-off-by: Barry Song --- MAINTAINERS | 6 ++ tools/testing/selftests/dma/Makefile | 6 ++ tools/testing/selftests/dma/config| 1 + .../testing/selftests/dma/dma_map_benchmark.c | 87 +++ 4 files changed, 100 insertions(+) create mode 100644 tools/testing/selftests/dma/Makefile create mode 100644 tools/testing/selftests/dma/config create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c diff --git a/MAINTAINERS b/MAINTAINERS index 608fc8484c02..a1e38d5e14f6 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5247,6 +5247,12 @@ F: include/linux/dma-mapping.h F: include/linux/dma-map-ops.h F: kernel/dma/ +DMA MAPPING BENCHMARK +M: Barry Song +L: iommu@lists.linux-foundation.org +F: kernel/dma/map_benchmark.c +F: tools/testing/selftests/dma/ + DMA-BUF HEAPS FRAMEWORK M: Sumit Semwal R: Benjamin Gaignard diff --git a/tools/testing/selftests/dma/Makefile b/tools/testing/selftests/dma/Makefile new file mode 100644 index ..aa8e8b5b3864 --- /dev/null +++ b/tools/testing/selftests/dma/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0 +CFLAGS += -I../../../../usr/include/ + +TEST_GEN_PROGS := dma_map_benchmark + +include ../lib.mk diff --git a/tools/testing/selftests/dma/config b/tools/testing/selftests/dma/config new file mode 100644 index ..6102ee3c43cd --- /dev/null +++ b/tools/testing/selftests/dma/config @@ -0,0 +1 @@ +CONFIG_DMA_MAP_BENCHMARK=y diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c b/tools/testing/selftests/dma/dma_map_benchmark.c new file mode 100644 index ..4778df0c458f --- /dev/null +++ b/tools/testing/selftests/dma/dma_map_benchmark.c @@ -0,0 +1,87 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2020 Hisilicon Limited. + */ + +#include +#include +#include +#include +#include +#include +#include + +#define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) +#define DMA_MAP_MAX_THREADS1024 +#define DMA_MAP_MAX_SECONDS 300 + +struct map_benchmark { + __u64 avg_map_100ns; /* average map latency in 100ns */ + __u64 map_stddev; /* standard deviation of map latency */ + __u64 avg_unmap_100ns; /* as above */ + __u64 unmap_stddev; + __u32 threads; /* how many threads will do map/unmap in parallel */ + __u32 seconds; /* how long the test will last */ + int node; /* which numa node this benchmark will run on */ + __u64 expansion[10];/* For future use */ +}; + +int main(int argc, char **argv) +{ + struct map_benchmark map; + int fd, opt; + /* default single thread, run 20 seconds on NUMA_NO_NODE */ + int threads = 1, seconds = 20, node = -1; + int cmd = DMA_MAP_BENCHMARK; + char *p; + + while ((opt = getopt(argc, argv, "t:s:n:")) != -1) { + switch (opt) { + case 't': + threads = atoi(optarg); + break; + case 's': + seconds = atoi(optarg); + break; + case 'n': + node = atoi(optarg); + break; + default: + return -1; + } + } + + if (threads <= 0 || threads > DMA_MAP_MAX_THREADS) { + fprintf(stderr, "invalid number of threads, must be in 1-%d\n", + DMA_MAP_MAX_THREADS); + exit(1); + } + + if (seconds <= 0 || seconds > DMA_MAP_MAX_SECONDS) { + fprintf(stderr, "invalid number of seconds, must be in 1-%d\n", + DMA_MAP_MAX_SECONDS); + exit(1); + } + +
[PATCH v2 1/2] dma-mapping: add benchmark support for streaming DMA APIs
Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU. This patch enables the support. Users can run specified number of threads to do dma_map_page and dma_unmap_page on a specific NUMA node with the specified duration. Then dma_map_benchmark will calculate the average latency for map and unmap. A difficulity for this benchmark is that dma_map/unmap APIs must run on a particular device. Each device might have different backend of IOMMU or non-IOMMU. So we use the driver_override to bind dma_map_benchmark to a particual device by: For platform devices: echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind For PCI devices: echo dma_map_benchmark > /sys/bus/pci/devices/:00:01.0/driver_override echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo :00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind Cc: Joerg Roedel Cc: Will Deacon Cc: Shuah Khan Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Robin Murphy Signed-off-by: Barry Song --- -v2: * add PCI support; v1 supported platform devices only * replace ssleep by msleep_interruptible() to permit users to exit benchmark before it is completed * many changes according to Robin's suggestions, thanks! Robin - add standard deviation output to reflect the worst case - check users' parameters strictly like the number of threads - make cache dirty before dma_map - fix unpaired dma_map_page and dma_unmap_single; - remove redundant "long long" before ktime_to_ns(); - use devm_add_action(); - wakeup all threads together after they are ready kernel/dma/Kconfig | 8 + kernel/dma/Makefile| 1 + kernel/dma/map_benchmark.c | 295 + 3 files changed, 304 insertions(+) create mode 100644 kernel/dma/map_benchmark.c diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index c99de4a21458..949c53da5991 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG is technically out-of-spec. If unsure, say N. + +config DMA_MAP_BENCHMARK + bool "Enable benchmarking of streaming DMA mapping" + help + Provides /sys/kernel/debug/dma_map_benchmark that helps with testing + performance of dma_(un)map_page. + + See tools/testing/selftests/dma/dma_map_benchmark.c diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index dc755ab68aab..7aa6b26b1348 100644 --- a/kernel/dma/Makefile +++ b/kernel/dma/Makefile @@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG) += debug.o obj-$(CONFIG_SWIOTLB) += swiotlb.o obj-$(CONFIG_DMA_COHERENT_POOL)+= pool.o obj-$(CONFIG_DMA_REMAP)+= remap.o +obj-$(CONFIG_DMA_MAP_BENCHMARK)+= map_benchmark.o diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c new file mode 100644 index ..ac397758087b --- /dev/null +++ b/kernel/dma/map_benchmark.c @@ -0,0 +1,295 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2020 Hisilicon Limited. + */ + +#define pr_fmt(fmt)KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) +#define DMA_MAP_MAX_THREADS1024 +#define DMA_MAP_MAX_SECONDS300 + +struct map_benchmark { + __u64 avg_map_100ns; /* average map latency in 100ns */ + __u64 map_stddev; /* standard deviation of map latency */ + __u64 avg_unmap_100ns; /* as above */ + __u64 unmap_stddev; + __u32 threads; /* how many threads will do map/unmap in parallel */ + __u32 seconds; /* how long the test will last */ + int node; /* which numa node this benchmark will run on */ + __u64 expansion[10];/* For future use */ +}; + +struct map_benchmark_data { + struct map_benchmark bparam; + struct device *dev; + struct dentry *debugfs; + atomic64_t sum_map_100ns; + atomic64_t sum_unmap_100ns; + atomic64_t sum_square_map; + atomic64_t sum_square_unmap; + atomic64_t loops; +}; + +static int map_benchmark_thread(void *data) +{ + void *buf; + dma_addr_t dma_addr; + struct map_benchmark_data *map = data; + int ret = 0; + + buf = (void *)__get_free_page(GFP_KERNEL); + if (!buf) + return -ENOMEM; + + while (!kthread_should_stop()) { + __u64 map_100ns, unmap_100ns, map_square, unmap_square; + ktime_t map_stime, map_etime, unmap_stime, unmap_etime; + + /* +* for a non-coherent device, if we
[PATCH v2 0/2] dma-mapping: provide a benchmark for streaming DMA mapping
Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU. This patchset provides the benchmark infrastruture for streaming DMA mapping. The architecture of the code is pretty much similar with GUP benchmark: * mm/gup_benchmark.c provides kernel interface; * tools/testing/selftests/vm/gup_benchmark.c provides user program to call the interface provided by mm/gup_benchmark.c. In our case, kernel/dma/map_benchmark.c is like mm/gup_benchmark.c; tools/testing/selftests/dma/dma_map_benchmark.c is like tools/testing/ selftests/vm/gup_benchmark.c A major difference with GUP benchmark is DMA_MAP benchmark needs to run on a device. Considering one board with below devices and IOMMUs device A --- IOMMU 1 device B --- IOMMU 2 device C --- non-IOMMU Different devices might attach to different IOMMU or non-IOMMU. To make benchmark run, we can either * create a virtual device and hack the kernel code to attach the virtual device to IOMMU1, IOMMU2 or non-IOMMU. * use the existing driver_override mechinism, unbind device A,B, OR c from their original driver and bind A to dma_map_benchmark platform driver or pci driver for benchmarking. In this patchset, I prefer to use the driver_override and avoid the ugly hack in kernel. We can dynamically switch device behind different IOMMUs to get the performance of IOMMU or non-IOMMU. -v2: * add PCI support; v1 supported platform devices only * replace ssleep by msleep_interruptible() to permit users to exit benchmark before it is completed * many changes according to Robin's suggestions, thanks! Robin - add standard deviation output to reflect the worst case - check users' parameters strictly like the number of threads - make cache dirty before dma_map - fix unpaired dma_map_page and dma_unmap_single; - remove redundant "long long" before ktime_to_ns(); - use devm_add_action(); - wakeup all threads together after they are ready Barry Song (2): dma-mapping: add benchmark support for streaming DMA APIs selftests/dma: add test application for DMA_MAP_BENCHMARK MAINTAINERS | 6 + kernel/dma/Kconfig| 8 + kernel/dma/Makefile | 1 + kernel/dma/map_benchmark.c| 295 ++ tools/testing/selftests/dma/Makefile | 6 + tools/testing/selftests/dma/config| 1 + .../testing/selftests/dma/dma_map_benchmark.c | 87 ++ 7 files changed, 404 insertions(+) create mode 100644 kernel/dma/map_benchmark.c create mode 100644 tools/testing/selftests/dma/Makefile create mode 100644 tools/testing/selftests/dma/config create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 2/2] selftests/dma: add test application for DMA_MAP_BENCHMARK
This patch provides the test application for DMA_MAP_BENCHMARK. Before running the test application, we need to bind a device to dma_map_ benchmark driver. For example, unbind "xxx" from its original driver and bind to dma_map_benchmark: echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind Another example for PCI devices: echo dma_map_benchmark > /sys/bus/pci/devices/:00:01.0/driver_override echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo :00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind The below command will run 16 threads on numa node 0 for 10 seconds on the device bound to dma_map_benchmark platform_driver or pci_driver: ./dma_map_benchmark -t 16 -s 10 -n 0 dma mapping benchmark: threads:16 seconds:10 average map latency(us):1.1 standard deviation:1.9 average unmap latency(us):0.5 standard deviation:0.8 Cc: Joerg Roedel Cc: Will Deacon Cc: Shuah Khan Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Robin Murphy Signed-off-by: Barry Song --- -v2: * check parameters like threads, seconds strictly * print standard deviation for latencies MAINTAINERS | 6 ++ tools/testing/selftests/dma/Makefile | 6 ++ tools/testing/selftests/dma/config| 1 + .../testing/selftests/dma/dma_map_benchmark.c | 87 +++ 4 files changed, 100 insertions(+) create mode 100644 tools/testing/selftests/dma/Makefile create mode 100644 tools/testing/selftests/dma/config create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c diff --git a/MAINTAINERS b/MAINTAINERS index 608fc8484c02..a1e38d5e14f6 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5247,6 +5247,12 @@ F: include/linux/dma-mapping.h F: include/linux/dma-map-ops.h F: kernel/dma/ +DMA MAPPING BENCHMARK +M: Barry Song +L: iommu@lists.linux-foundation.org +F: kernel/dma/map_benchmark.c +F: tools/testing/selftests/dma/ + DMA-BUF HEAPS FRAMEWORK M: Sumit Semwal R: Benjamin Gaignard diff --git a/tools/testing/selftests/dma/Makefile b/tools/testing/selftests/dma/Makefile new file mode 100644 index ..aa8e8b5b3864 --- /dev/null +++ b/tools/testing/selftests/dma/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0 +CFLAGS += -I../../../../usr/include/ + +TEST_GEN_PROGS := dma_map_benchmark + +include ../lib.mk diff --git a/tools/testing/selftests/dma/config b/tools/testing/selftests/dma/config new file mode 100644 index ..6102ee3c43cd --- /dev/null +++ b/tools/testing/selftests/dma/config @@ -0,0 +1 @@ +CONFIG_DMA_MAP_BENCHMARK=y diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c b/tools/testing/selftests/dma/dma_map_benchmark.c new file mode 100644 index ..4778df0c458f --- /dev/null +++ b/tools/testing/selftests/dma/dma_map_benchmark.c @@ -0,0 +1,87 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2020 Hisilicon Limited. + */ + +#include +#include +#include +#include +#include +#include +#include + +#define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) +#define DMA_MAP_MAX_THREADS1024 +#define DMA_MAP_MAX_SECONDS 300 + +struct map_benchmark { + __u64 avg_map_100ns; /* average map latency in 100ns */ + __u64 map_stddev; /* standard deviation of map latency */ + __u64 avg_unmap_100ns; /* as above */ + __u64 unmap_stddev; + __u32 threads; /* how many threads will do map/unmap in parallel */ + __u32 seconds; /* how long the test will last */ + int node; /* which numa node this benchmark will run on */ + __u64 expansion[10];/* For future use */ +}; + +int main(int argc, char **argv) +{ + struct map_benchmark map; + int fd, opt; + /* default single thread, run 20 seconds on NUMA_NO_NODE */ + int threads = 1, seconds = 20, node = -1; + int cmd = DMA_MAP_BENCHMARK; + char *p; + + while ((opt = getopt(argc, argv, "t:s:n:")) != -1) { + switch (opt) { + case 't': + threads = atoi(optarg); + break; + case 's': + seconds = atoi(optarg); + break; + case 'n': + node = atoi(optarg); + break; + default: + return -1; + } + } + + if (threads <= 0 || threads > DMA_MAP_MAX_THREADS) { + fprintf(stderr, "invalid number of threads, must be in 1-%d\n", + DMA_MAP_MAX_THREADS); + exit(1); + } + + if (seconds <= 0 || seconds > DMA_MAP_MAX_SECONDS) { + fprintf(stderr, "invalid number of seconds, must be
RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs
> -Original Message- > From: Song Bao Hua (Barry Song) [mailto:song.bao@hisilicon.com] > Sent: Saturday, October 31, 2020 10:45 PM > To: Robin Murphy ; > iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com > Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm > ; linux-kselft...@vger.kernel.org > Subject: RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming > DMA APIs > > > > > -Original Message- > > From: Robin Murphy [mailto:robin.mur...@arm.com] > > Sent: Saturday, October 31, 2020 4:48 AM > > To: Song Bao Hua (Barry Song) ; > > iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com > > Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm > > ; linux-kselft...@vger.kernel.org > > Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for > streaming > > DMA APIs > > > > On 2020-10-29 21:39, Song Bao Hua (Barry Song) wrote: > > [...] > > >>> +struct map_benchmark { > > >>> + __u64 map_nsec; > > >>> + __u64 unmap_nsec; > > >>> + __u32 threads; /* how many threads will do map/unmap in parallel > > */ > > >>> + __u32 seconds; /* how long the test will last */ > > >>> + int node; /* which numa node this benchmark will run on */ > > >>> + __u64 expansion[10];/* For future use */ > > >>> +}; > > >> > > >> I'm no expert on userspace ABIs (and what little experience I do have > > >> is mostly of Win32...), so hopefully someone else will comment if > > >> there's anything of concern here. One thing I wonder is that there's > > >> a fair likelihood of functionality evolving here over time, so might > > >> it be appropriate to have some sort of explicit versioning parameter > > >> for robustness? > > > > > > I copied that from gup_benchmark. There is no this kind of code to > > > compare version. > > > I believe there is a likelihood that kernel module is changed but > > > users are still using old userspace tool, this might lead to the > > > incompatible data structure. > > > But not sure if it is a big problem :-) > > > > Yeah, like I say I don't really have a good feeling for what would be best > > here, > > I'm just thinking of what I do know and wary of the potential for a "640 > > bits > > ought to be enough for anyone" issue ;) > > > > >>> +struct map_benchmark_data { > > >>> + struct map_benchmark bparam; > > >>> + struct device *dev; > > >>> + struct dentry *debugfs; > > >>> + atomic64_t total_map_nsecs; > > >>> + atomic64_t total_map_loops; > > >>> + atomic64_t total_unmap_nsecs; > > >>> + atomic64_t total_unmap_loops; > > >>> +}; > > >>> + > > >>> +static int map_benchmark_thread(void *data) { > > >>> + struct page *page; > > >>> + dma_addr_t dma_addr; > > >>> + struct map_benchmark_data *map = data; > > >>> + int ret = 0; > > >>> + > > >>> + page = alloc_page(GFP_KERNEL); > > >>> + if (!page) > > >>> + return -ENOMEM; > > >>> + > > >>> + while (!kthread_should_stop()) { > > >>> + ktime_t map_stime, map_etime, unmap_stime, unmap_etime; > > >>> + > > >>> + map_stime = ktime_get(); > > >>> + dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE, > > >> DMA_BIDIRECTIONAL); > > >> > > >> Note that for a non-coherent device, this will give an underestimate > > >> of the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings, > > >> since the page will never be dirty in the cache (except possibly the > > >> very first time through). > > > > > > Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1" > > > after we have this basic framework. > > > > That wasn't so much about the direction itself, just that if it's anything > > other > > than FROM_DEVICE, we should probably do something to dirty the buffer by > a > > reasonable amount before each map. Otherwise the measured performance > is > > going to be unreal
RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs
> -Original Message- > From: Robin Murphy [mailto:robin.mur...@arm.com] > Sent: Saturday, October 31, 2020 4:48 AM > To: Song Bao Hua (Barry Song) ; > iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com > Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm > ; linux-kselft...@vger.kernel.org > Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for streaming > DMA APIs > > On 2020-10-29 21:39, Song Bao Hua (Barry Song) wrote: > [...] > >>> +struct map_benchmark { > >>> + __u64 map_nsec; > >>> + __u64 unmap_nsec; > >>> + __u32 threads; /* how many threads will do map/unmap in parallel > */ > >>> + __u32 seconds; /* how long the test will last */ > >>> + int node; /* which numa node this benchmark will run on */ > >>> + __u64 expansion[10];/* For future use */ > >>> +}; > >> > >> I'm no expert on userspace ABIs (and what little experience I do have > >> is mostly of Win32...), so hopefully someone else will comment if > >> there's anything of concern here. One thing I wonder is that there's > >> a fair likelihood of functionality evolving here over time, so might > >> it be appropriate to have some sort of explicit versioning parameter > >> for robustness? > > > > I copied that from gup_benchmark. There is no this kind of code to > > compare version. > > I believe there is a likelihood that kernel module is changed but > > users are still using old userspace tool, this might lead to the > > incompatible data structure. > > But not sure if it is a big problem :-) > > Yeah, like I say I don't really have a good feeling for what would be best > here, > I'm just thinking of what I do know and wary of the potential for a "640 bits > ought to be enough for anyone" issue ;) > > >>> +struct map_benchmark_data { > >>> + struct map_benchmark bparam; > >>> + struct device *dev; > >>> + struct dentry *debugfs; > >>> + atomic64_t total_map_nsecs; > >>> + atomic64_t total_map_loops; > >>> + atomic64_t total_unmap_nsecs; > >>> + atomic64_t total_unmap_loops; > >>> +}; > >>> + > >>> +static int map_benchmark_thread(void *data) { > >>> + struct page *page; > >>> + dma_addr_t dma_addr; > >>> + struct map_benchmark_data *map = data; > >>> + int ret = 0; > >>> + > >>> + page = alloc_page(GFP_KERNEL); > >>> + if (!page) > >>> + return -ENOMEM; > >>> + > >>> + while (!kthread_should_stop()) { > >>> + ktime_t map_stime, map_etime, unmap_stime, unmap_etime; > >>> + > >>> + map_stime = ktime_get(); > >>> + dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE, > >> DMA_BIDIRECTIONAL); > >> > >> Note that for a non-coherent device, this will give an underestimate > >> of the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings, > >> since the page will never be dirty in the cache (except possibly the > >> very first time through). > > > > Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1" > > after we have this basic framework. > > That wasn't so much about the direction itself, just that if it's anything > other > than FROM_DEVICE, we should probably do something to dirty the buffer by a > reasonable amount before each map. Otherwise the measured performance is > going to be unrealistic on many systems. Maybe put a memset(buf, 0, PAGE_SIZE) before dma_map will help ? > > [...] > >>> + atomic64_add((long long)ktime_to_ns(ktime_sub(unmap_etime, > >> unmap_stime)), > >>> + &map->total_unmap_nsecs); > >>> + atomic64_inc(&map->total_map_loops); > >>> + atomic64_inc(&map->total_unmap_loops); > >> > >> I think it would be worth keeping track of the variances as well - it > >> can be hard to tell if a reasonable-looking average is hiding > >> terrible worst-case behaviour. > > > > This is a sensible requirement. I believe it is better to be handled > > by the existing kernel tracing method. > > > > Maybe we need a histogram like: > > Delay sample count > > 1-2us 1000 *** > > 2-3us 2000 *** > > 3-4us 100 * > > . > > This will be mo
RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs
> -Original Message- > From: Robin Murphy [mailto:robin.mur...@arm.com] > Sent: Friday, October 30, 2020 8:38 AM > To: Song Bao Hua (Barry Song) ; > iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com > Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm > ; linux-kselft...@vger.kernel.org > Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for streaming > DMA APIs > > On 2020-10-27 03:53, Barry Song wrote: > > Nowadays, there are increasing requirements to benchmark the performance > > of dma_map and dma_unmap particually while the device is attached to an > > IOMMU. > > > > This patch enables the support. Users can run specified number of threads > > to do dma_map_page and dma_unmap_page on a specific NUMA node with > the > > specified duration. Then dma_map_benchmark will calculate the average > > latency for map and unmap. > > > > A difficulity for this benchmark is that dma_map/unmap APIs must run on > > a particular device. Each device might have different backend of IOMMU or > > non-IOMMU. > > > > So we use the driver_override to bind dma_map_benchmark to a particual > > device by: > > echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override > > echo xxx > /sys/bus/platform/drivers/xxx/unbind > > echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind > > > > For this moment, it supports platform device only, PCI device will also > > be supported afterwards. > > Neat! This is something I've thought about many times, but never got > round to attempting :) I am happy you have the same needs. When I came to IOMMU area a half year ago, the first thing I've done was writing a rough benchmark. At that time, I hacked kernel to get a device behind an IOMMU. Recently, I got some time to think about how to get "device" without ugly hacking and then clean up code for sending patches out to provide a common benchmark in order that everybody can use. > > I think the basic latency measurement for mapping and unmapping pages is > enough to start with, but there are definitely some more things that > would be interesting to look into for future enhancements: > > - a choice of mapping sizes, both smaller and larger than one page, to > help characterise stuff like cache maintenance overhead and bounce > buffer/IOVA fragmentation. > - alternative allocation patterns like doing lots of maps first, then > all their corresponding unmaps (to provoke things like the worst-case > IOVA rcache behaviour). > - ways to exercise a range of those parameters at once across > different threads in a single test. > Yes, sure. Once we have a basic framework, we can add more benchmark patterns by using different parameters in the userspace tool: testing/selftests/dma/dma_map_benchmark.c Similar function extensions have been carried out in GUP_BENCHMARK. > But let's get a basic framework nailed down first... Sure. > > > Cc: Joerg Roedel > > Cc: Will Deacon > > Cc: Shuah Khan > > Cc: Christoph Hellwig > > Cc: Marek Szyprowski > > Cc: Robin Murphy > > Signed-off-by: Barry Song > > --- > > kernel/dma/Kconfig | 8 ++ > > kernel/dma/Makefile| 1 + > > kernel/dma/map_benchmark.c | 202 > + > > 3 files changed, 211 insertions(+) > > create mode 100644 kernel/dma/map_benchmark.c > > > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig > > index c99de4a21458..949c53da5991 100644 > > --- a/kernel/dma/Kconfig > > +++ b/kernel/dma/Kconfig > > @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG > > is technically out-of-spec. > > > > If unsure, say N. > > + > > +config DMA_MAP_BENCHMARK > > + bool "Enable benchmarking of streaming DMA mapping" > > + help > > + Provides /sys/kernel/debug/dma_map_benchmark that helps with > testing > > + performance of dma_(un)map_page. > > + > > + See tools/testing/selftests/dma/dma_map_benchmark.c > > diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile > > index dc755ab68aab..7aa6b26b1348 100644 > > --- a/kernel/dma/Makefile > > +++ b/kernel/dma/Makefile > > @@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG) += debug.o > > obj-$(CONFIG_SWIOTLB) += swiotlb.o > > obj-$(CONFIG_DMA_COHERENT_POOL) += pool.o > > obj-$(CONFIG_DMA_REMAP) += remap.o > > +obj-$(CONFIG_DMA_MAP_BENCHMARK)+= map_benchmark.o > > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/m
RE: [PATCH] dma: Per-NUMA-node CMA should depend on NUMA
> -Original Message- > From: h...@lst.de [mailto:h...@lst.de] > Sent: Tuesday, October 27, 2020 8:55 PM > To: Song Bao Hua (Barry Song) > Cc: Robin Murphy ; h...@lst.de; > iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org > Subject: Re: [PATCH] dma: Per-NUMA-node CMA should depend on NUMA > > On Mon, Oct 26, 2020 at 08:07:43PM +0000, Song Bao Hua (Barry Song) > wrote: > > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig > > > index c99de4a21458..964b74c9b7e3 100644 > > > --- a/kernel/dma/Kconfig > > > +++ b/kernel/dma/Kconfig > > > @@ -125,7 +125,8 @@ if DMA_CMA > > > > > > config DMA_PERNUMA_CMA > > > bool "Enable separate DMA Contiguous Memory Area for each NUMA > > > Node" > > > - default NUMA && ARM64 > > > + depends on NUMA > > > + default ARM64 > > > > On the other hand, at this moment, only ARM64 is calling the init code > > to get per_numa cma. Do we need to > > depends on NUMA && ARM64 ? > > so that this is not enabled by non-arm64? > > I actually hate having arch symbols in common code. A new > ARCH_HAS_DMA_PERNUMA_CMA, only selected by arm64 for now would be > more > clean I think. Sounds good to me. BTW, +Will. Last time we talked about default pernuma cma size, you suggested a bootargs in arch/arm64/Kconfig but Will seems to have different idea. Am I right, Will? Would we let aarch64 call dma_pernuma_cma_reserve(16MB) rather than dma_pernuma_cma_reserve()? In this way, users will at least get a default pernuma CMA which is required at least by IOMMU. If users set a "cma_pernuma" bootargs, it will overwrite the default size from aarch64 code? I mean - void __init dma_pernuma_cma_reserve(size_t size) + void __init dma_pernuma_cma_reserve(size_t size) { if (!pernuma_size_bytes) + pernuma_size_bytes = size; } Right now, it is easy that users will forget to set cma_pernuma in bootargs. Probably this feature is not enabled by users. Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 0/2] dma-mapping: provide a benchmark for streaming DMA mapping
Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU. This patchset provides the benchmark infrastruture for streaming DMA mapping. The architecture of the code is pretty much similar with GUP benchmark: * mm/gup_benchmark.c provides kernel interface; * tools/testing/selftests/vm/gup_benchmark.c provides user program to call the interface provided by mm/gup_benchmark.c. In our case, kernel/dma/map_benchmark.c is like mm/gup_benchmark.c; tools/testing/selftests/dma/dma_map_benchmark.c is like tools/testing/ selftests/vm/gup_benchmark.c A major difference with GUP benchmark is DMA_MAP benchmark needs to run on a device. Considering one board with below devices and IOMMUs device A --- IOMMU 1 device B --- IOMMU 2 device C --- non-IOMMU Different devices might attach to different IOMMU or non-IOMMU. To make benchmark run, we can either * create a virtual device and hack the kernel code to attach the virtual device to IOMMU1, IOMMU2 or non-IOMMU. * use the existing driver_override mechinism, unbind device A,B, or c from their original driver and bind them to "dma_map_benchmark" platform_driver or pci_driver for benchmarking. In this patchset, I prefer to use the driver_override and avoid the various hack in kernel. We can dynamically switch devices behind different IOMMUs to get the performance of dma map on IOMMU or non-IOMMU. Barry Song (2): dma-mapping: add benchmark support for streaming DMA APIs selftests/dma: add test application for DMA_MAP_BENCHMARK MAINTAINERS | 6 + kernel/dma/Kconfig| 8 + kernel/dma/Makefile | 1 + kernel/dma/map_benchmark.c| 202 ++ tools/testing/selftests/dma/Makefile | 6 + tools/testing/selftests/dma/config| 1 + .../testing/selftests/dma/dma_map_benchmark.c | 72 +++ 7 files changed, 296 insertions(+) create mode 100644 kernel/dma/map_benchmark.c create mode 100644 tools/testing/selftests/dma/Makefile create mode 100644 tools/testing/selftests/dma/config create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 2/2] selftests/dma: add test application for DMA_MAP_BENCHMARK
This patch provides the test application for DMA_MAP_BENCHMARK. Before running the test application, we need to bind a device to dma_map_ benchmark driver. For example, unbind "xxx" from its original driver and bind to dma_map_benchmark: echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind Then, run 10 threads on numa node 1 for 10 seconds on device "xxx": ./dma_map_benchmark -t 10 -s 10 -n 1 dma mapping benchmark: average map_nsec:3619 average unmap_nsec:2423 Cc: Joerg Roedel Cc: Will Deacon Cc: Shuah Khan Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Robin Murphy Signed-off-by: Barry Song --- MAINTAINERS | 6 ++ tools/testing/selftests/dma/Makefile | 6 ++ tools/testing/selftests/dma/config| 1 + .../testing/selftests/dma/dma_map_benchmark.c | 72 +++ 4 files changed, 85 insertions(+) create mode 100644 tools/testing/selftests/dma/Makefile create mode 100644 tools/testing/selftests/dma/config create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c diff --git a/MAINTAINERS b/MAINTAINERS index f310f0a09904..552389874ca2 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5220,6 +5220,12 @@ F: include/linux/dma-mapping.h F: include/linux/dma-map-ops.h F: kernel/dma/ +DMA MAPPING BENCHMARK +M: Barry Song +L: iommu@lists.linux-foundation.org +F: kernel/dma/map_benchmark.c +F: tools/testing/selftests/dma/ + DMA-BUF HEAPS FRAMEWORK M: Sumit Semwal R: Andrew F. Davis diff --git a/tools/testing/selftests/dma/Makefile b/tools/testing/selftests/dma/Makefile new file mode 100644 index ..aa8e8b5b3864 --- /dev/null +++ b/tools/testing/selftests/dma/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0 +CFLAGS += -I../../../../usr/include/ + +TEST_GEN_PROGS := dma_map_benchmark + +include ../lib.mk diff --git a/tools/testing/selftests/dma/config b/tools/testing/selftests/dma/config new file mode 100644 index ..6102ee3c43cd --- /dev/null +++ b/tools/testing/selftests/dma/config @@ -0,0 +1 @@ +CONFIG_DMA_MAP_BENCHMARK=y diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c b/tools/testing/selftests/dma/dma_map_benchmark.c new file mode 100644 index ..e03bd03e101e --- /dev/null +++ b/tools/testing/selftests/dma/dma_map_benchmark.c @@ -0,0 +1,72 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2020 Hisilicon Limited. + */ + +#include +#include +#include +#include +#include +#include +#include + +#define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) + +struct map_benchmark { + __u64 map_nsec; + __u64 unmap_nsec; + __u32 threads; /* how many threads will do map/unmap in parallel */ + __u32 seconds; /* how long the test will last */ + int node; /* which numa node this benchmark will run on */ + __u64 expansion[10];/* For future use */ +}; + +int main(int argc, char **argv) +{ + struct map_benchmark map; + int fd, opt, threads = 0, seconds = 0, node = -1; + int cmd = DMA_MAP_BENCHMARK; + char *p; + + while ((opt = getopt(argc, argv, "t:s:n:")) != -1) { + switch (opt) { + case 't': + threads = atoi(optarg); + break; + case 's': + seconds = atoi(optarg); + break; + case 'n': + node = atoi(optarg); + break; + default: + return -1; + } + } + + if (threads <= 0 || seconds <= 0) { + perror("invalid number of threads or seconds"); + exit(1); + } + + fd = open("/sys/kernel/debug/dma_map_benchmark", O_RDWR); + if (fd == -1) { + perror("open"); + exit(1); + } + + map.seconds = seconds; + map.threads = threads; + map.node = node; + if (ioctl(fd, cmd, &map)) { + perror("ioctl"); + exit(1); + } + + printf("dma mapping benchmark: average map_nsec:%lld average unmap_nsec:%lld\n", + map.map_nsec, + map.unmap_nsec); + + return 0; +} -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs
Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU. This patch enables the support. Users can run specified number of threads to do dma_map_page and dma_unmap_page on a specific NUMA node with the specified duration. Then dma_map_benchmark will calculate the average latency for map and unmap. A difficulity for this benchmark is that dma_map/unmap APIs must run on a particular device. Each device might have different backend of IOMMU or non-IOMMU. So we use the driver_override to bind dma_map_benchmark to a particual device by: echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind For this moment, it supports platform device only, PCI device will also be supported afterwards. Cc: Joerg Roedel Cc: Will Deacon Cc: Shuah Khan Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Robin Murphy Signed-off-by: Barry Song --- kernel/dma/Kconfig | 8 ++ kernel/dma/Makefile| 1 + kernel/dma/map_benchmark.c | 202 + 3 files changed, 211 insertions(+) create mode 100644 kernel/dma/map_benchmark.c diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index c99de4a21458..949c53da5991 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG is technically out-of-spec. If unsure, say N. + +config DMA_MAP_BENCHMARK + bool "Enable benchmarking of streaming DMA mapping" + help + Provides /sys/kernel/debug/dma_map_benchmark that helps with testing + performance of dma_(un)map_page. + + See tools/testing/selftests/dma/dma_map_benchmark.c diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index dc755ab68aab..7aa6b26b1348 100644 --- a/kernel/dma/Makefile +++ b/kernel/dma/Makefile @@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG) += debug.o obj-$(CONFIG_SWIOTLB) += swiotlb.o obj-$(CONFIG_DMA_COHERENT_POOL)+= pool.o obj-$(CONFIG_DMA_REMAP)+= remap.o +obj-$(CONFIG_DMA_MAP_BENCHMARK)+= map_benchmark.o diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c new file mode 100644 index ..16a5d7779d67 --- /dev/null +++ b/kernel/dma/map_benchmark.c @@ -0,0 +1,202 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2020 Hisilicon Limited. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark) + +struct map_benchmark { + __u64 map_nsec; + __u64 unmap_nsec; + __u32 threads; /* how many threads will do map/unmap in parallel */ + __u32 seconds; /* how long the test will last */ + int node; /* which numa node this benchmark will run on */ + __u64 expansion[10];/* For future use */ +}; + +struct map_benchmark_data { + struct map_benchmark bparam; + struct device *dev; + struct dentry *debugfs; + atomic64_t total_map_nsecs; + atomic64_t total_map_loops; + atomic64_t total_unmap_nsecs; + atomic64_t total_unmap_loops; +}; + +static int map_benchmark_thread(void *data) +{ + struct page *page; + dma_addr_t dma_addr; + struct map_benchmark_data *map = data; + int ret = 0; + + page = alloc_page(GFP_KERNEL); + if (!page) + return -ENOMEM; + + while (!kthread_should_stop()) { + ktime_t map_stime, map_etime, unmap_stime, unmap_etime; + + map_stime = ktime_get(); + dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE, DMA_BIDIRECTIONAL); + if (unlikely(dma_mapping_error(map->dev, dma_addr))) { + dev_err(map->dev, "dma_map_page failed\n"); + ret = -ENOMEM; + goto out; + } + map_etime = ktime_get(); + + unmap_stime = ktime_get(); + dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL); + unmap_etime = ktime_get(); + + atomic64_add((long long)ktime_to_ns(ktime_sub(map_etime, map_stime)), + &map->total_map_nsecs); + atomic64_add((long long)ktime_to_ns(ktime_sub(unmap_etime, unmap_stime)), + &map->total_unmap_nsecs); + atomic64_inc(&map->total_map_loops); + atomic64_inc(&map->total_unmap_loops); + } + +out: + __free_page(page); + return ret; +} + +static int do_map_benchmark(struct map_benchmark_data *map) +{ + struct task_struct **tsk; + int threads = ma
RE: [PATCH] dma: Per-NUMA-node CMA should depend on NUMA
> -Original Message- > From: Robin Murphy [mailto:robin.mur...@arm.com] > Sent: Tuesday, October 27, 2020 1:25 AM > To: h...@lst.de > Cc: iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org; Song Bao > Hua (Barry Song) > Subject: [PATCH] dma: Per-NUMA-node CMA should depend on NUMA > > Offering DMA_PERNUMA_CMA to non-NUMA configs is pointless. > This is right. > Signed-off-by: Robin Murphy > --- > kernel/dma/Kconfig | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig > index c99de4a21458..964b74c9b7e3 100644 > --- a/kernel/dma/Kconfig > +++ b/kernel/dma/Kconfig > @@ -125,7 +125,8 @@ if DMA_CMA > > config DMA_PERNUMA_CMA > bool "Enable separate DMA Contiguous Memory Area for each NUMA > Node" > - default NUMA && ARM64 > + depends on NUMA > + default ARM64 On the other hand, at this moment, only ARM64 is calling the init code to get per_numa cma. Do we need to depends on NUMA && ARM64 ? so that this is not enabled by non-arm64? > help > Enable this option to get pernuma CMA areas so that devices like > ARM64 SMMU can get local memory by DMA coherent APIs. > -- Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of John Garry > Sent: Saturday, August 22, 2020 1:54 AM > To: w...@kernel.org; robin.mur...@arm.com > Cc: j...@8bytes.org; linux-arm-ker...@lists.infradead.org; > iommu@lists.linux-foundation.org; m...@kernel.org; Linuxarm > ; linux-ker...@vger.kernel.org; John Garry > > Subject: [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency > > As mentioned in [0], the CPU may consume many cycles processing > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to > get space on the queue takes a lot of time once we start getting many CPUs > contending - from experiment, for 64 CPUs contending the cmdq, success rate > is ~ 1 in 12, which is poor, but not totally awful. > > This series removes that cmpxchg() and replaces with an atomic_add, same as > how the actual cmdq deals with maintaining the prod pointer. > > For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput > increase: > Before: 1250K IOPs > After: 1550K IOPs > > I also have a test harness to check the rate of DMA map+unmaps we can > achieve: > > CPU count 8 16 32 64 > Before: 282K115K36K 11K > After:302K193K80K 30K > > (unit is map+unmaps per CPU per second) I have seen performance improvement on hns3 network by sending UDP with 1-32 threads: Threads number14 8 16 32 Before patch(TX Mbps) 7636.05 16444.36 21694.48 25746.40 25295.93 After patch(TX Mbps) 7711.60 16478.98 26561.06 32628.75 33764.56 As you can see, for 8,16,32 threads, network TX throughput improve much. For 1 and 4 threads, Tx throughput is almost seem before and after patch. This should be sensible as this patch is mainly for decreasing the lock contention. > > [0] > https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD2 > 4b8...@dggemi525-mbs.china.huawei.com/T/#ma02e301c38c3e94b7725e > 685757c27e39c7cbde3 > > Differences to v1: > - Simplify by dropping patch to always issue a CMD_SYNC > - Use 64b atomic add, keeping prod in a separate 32b field > > John Garry (2): > iommu/arm-smmu-v3: Calculate max commands per batch > iommu/arm-smmu-v3: Remove cmpxchg() in > arm_smmu_cmdq_issue_cmdlist() > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 166 > ++-- > 1 file changed, 114 insertions(+), 52 deletions(-) > > -- > 2.26.2 Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v6 0/2] make dma_alloc_coherent NUMA-aware by per-NUMA CMA
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Christoph Hellwig > Sent: Friday, August 21, 2020 6:19 PM > To: Song Bao Hua (Barry Song) > Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com; > w...@kernel.org; ganapatrao.kulka...@cavium.com; > catalin.mari...@arm.com; iommu@lists.linux-foundation.org; Linuxarm > ; linux-arm-ker...@lists.infradead.org; > linux-ker...@vger.kernel.org; huangdaode > Subject: Re: [PATCH v6 0/2] make dma_alloc_coherent NUMA-aware by > per-NUMA CMA > > FYI, as of the last one I'm fine now, bit I really need an ACK from > the arm64 maintainers. Hi Christoph, For the changes in arch/arm64, Will gave his ack here: https://lore.kernel.org/linux-iommu/20200821090116.GB20255@willie-the-truck/ and the patchset has been refined to v8 https://lore.kernel.org/linux-iommu/20200823230309.28980-1-song.bao@hisilicon.com/ with one additional patch to remove magic number: [PATCH v8 3/3] mm: cma: use CMA_MAX_NAME to define the length of cma name array https://lore.kernel.org/linux-iommu/20200823230309.28980-4-song.bao@hisilicon.com/ Hopefully, you didn't miss it:-) Does the new one need an Ack from Linux-mm maintainer? Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH] iommu/arm-smmu-v3: add tracepoints for cmdq_issue_cmdlist
> -Original Message- > From: Robin Murphy [mailto:robin.mur...@arm.com] > Sent: Friday, August 28, 2020 11:18 PM > To: Song Bao Hua (Barry Song) ; Will Deacon > > Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org; > j...@8bytes.org; Linuxarm > Subject: Re: [PATCH] iommu/arm-smmu-v3: add tracepoints for > cmdq_issue_cmdlist > > On 2020-08-28 12:02, Song Bao Hua (Barry Song) wrote: > > > > > >> -Original Message- > >> From: Will Deacon [mailto:w...@kernel.org] > >> Sent: Friday, August 28, 2020 10:29 PM > >> To: Song Bao Hua (Barry Song) > >> Cc: iommu@lists.linux-foundation.org; > linux-arm-ker...@lists.infradead.org; > >> robin.mur...@arm.com; j...@8bytes.org; Linuxarm > > >> Subject: Re: [PATCH] iommu/arm-smmu-v3: add tracepoints for > >> cmdq_issue_cmdlist > >> > >> On Thu, Aug 27, 2020 at 09:33:51PM +1200, Barry Song wrote: > >>> cmdq_issue_cmdlist() is the hotspot that uses a lot of time. This patch > >>> adds tracepoints for it to help debug. > >>> > >>> Signed-off-by: Barry Song > >>> --- > >>> * can furthermore develop an eBPF program to benchmark using this > trace > >> > >> Hmm, don't these things have a history of becoming ABI? If so, I don't > >> really want them in the driver at all, sorry. Do other drivers overcome > >> this somehow? > > > > This kind of tracepoints mainly works as a low-overhead probe point for > debug purpose. I don't think any > > application would depend on it. It is for debugging. And there are lots of > tracepoints in other drivers > > even in iommu driver core and intel_iommu driver :-) > > > > developers use it in one of the below ways: > > > > 1. get trace print from the ring buffer by reading debugfs > > root@ubuntu:/sys/kernel/debug/tracing/events/arm_smmu_v3# echo 1 > > enable > > # cat /sys/kernel/debug/tracing/trace_pipe > > -0 [058] ..s1 125444.768083: issue_cmdlist_exit: > arm-smmu-v3.2.auto cmd number=1 sync=1 > >-0 [058] ..s1 125444.768084: issue_cmdlist_entry: > arm-smmu-v3.2.auto cmd number=1 sync=1 > >-0 [058] ..s1 125444.768085: issue_cmdlist_exit: > arm-smmu-v3.2.auto cmd number=1 sync=1 > >-0 [058] ..s1 125444.768165: issue_cmdlist_entry: > arm-smmu-v3.2.auto cmd number=1 sync=1 > >-0 [058] ..s1 125444.768168: issue_cmdlist_exit: > arm-smmu-v3.2.auto cmd number=1 sync=1 > >-0 [058] ..s1 125444.768169: issue_cmdlist_entry: > arm-smmu-v3.2.auto cmd number=1 sync=1 > >-0 [058] ..s1 125444.768171: issue_cmdlist_exit: > arm-smmu-v3.2.auto cmd number=1 sync=1 > >-0 [058] ..s1 125444.768259: issue_cmdlist_entry: > arm-smmu-v3.2.auto cmd number=1 sync=1 > >... > > > > This can replace printk with much much lower overhead. > > > > 2. add a hook function in tracepoint to do some latency measure and time > statistics just like the eBPF example > > I gave after the commit log. > > > > Using it, I can get the histogram of the execution time of > cmdq_issue_cmdlist(): > > nsecs : count distribution > > 0 -> 1 : 0| > | > > 2 -> 3 : 0| > | > > 4 -> 7 : 0| > | > > 8 -> 15 : 0| > | > > 16 -> 31 : 0| > | > > 32 -> 63 : 0| > | > > 64 -> 127: 0| > | > > 128 -> 255: 0| > | > > 256 -> 511: 0| > | > > 512 -> 1023 : 58 | > | > >1024 -> 2047 : 22763 > || > >2048 -> 4095 : 13238|*** > | > > > > I feel it is very common to do this kind of things for analyzing the > performance issue. For example, to easy the analysis > > of softirq latency, softirq.c has the below code: > > > > asmlinkage __visible void __softirq_entry __do_softirq(void) > > { > > ... > > trace_softirq_entry(vec_nr); > > h->action(h); > > trace_softirq_exit(vec_nr); > > ... > > } > > If you only want to measure entry and exit of one specific function, > though, can't the function graph tracer already do that? Function graph is able to do t
RE: [PATCH] iommu/arm-smmu-v3: add tracepoints for cmdq_issue_cmdlist
> -Original Message- > From: Will Deacon [mailto:w...@kernel.org] > Sent: Friday, August 28, 2020 10:29 PM > To: Song Bao Hua (Barry Song) > Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org; > robin.mur...@arm.com; j...@8bytes.org; Linuxarm > Subject: Re: [PATCH] iommu/arm-smmu-v3: add tracepoints for > cmdq_issue_cmdlist > > On Thu, Aug 27, 2020 at 09:33:51PM +1200, Barry Song wrote: > > cmdq_issue_cmdlist() is the hotspot that uses a lot of time. This patch > > adds tracepoints for it to help debug. > > > > Signed-off-by: Barry Song > > --- > > * can furthermore develop an eBPF program to benchmark using this trace > > Hmm, don't these things have a history of becoming ABI? If so, I don't > really want them in the driver at all, sorry. Do other drivers overcome > this somehow? This kind of tracepoints mainly works as a low-overhead probe point for debug purpose. I don't think any application would depend on it. It is for debugging. And there are lots of tracepoints in other drivers even in iommu driver core and intel_iommu driver :-) developers use it in one of the below ways: 1. get trace print from the ring buffer by reading debugfs root@ubuntu:/sys/kernel/debug/tracing/events/arm_smmu_v3# echo 1 > enable # cat /sys/kernel/debug/tracing/trace_pipe -0 [058] ..s1 125444.768083: issue_cmdlist_exit: arm-smmu-v3.2.auto cmd number=1 sync=1 -0 [058] ..s1 125444.768084: issue_cmdlist_entry: arm-smmu-v3.2.auto cmd number=1 sync=1 -0 [058] ..s1 125444.768085: issue_cmdlist_exit: arm-smmu-v3.2.auto cmd number=1 sync=1 -0 [058] ..s1 125444.768165: issue_cmdlist_entry: arm-smmu-v3.2.auto cmd number=1 sync=1 -0 [058] ..s1 125444.768168: issue_cmdlist_exit: arm-smmu-v3.2.auto cmd number=1 sync=1 -0 [058] ..s1 125444.768169: issue_cmdlist_entry: arm-smmu-v3.2.auto cmd number=1 sync=1 -0 [058] ..s1 125444.768171: issue_cmdlist_exit: arm-smmu-v3.2.auto cmd number=1 sync=1 -0 [058] ..s1 125444.768259: issue_cmdlist_entry: arm-smmu-v3.2.auto cmd number=1 sync=1 ... This can replace printk with much much lower overhead. 2. add a hook function in tracepoint to do some latency measure and time statistics just like the eBPF example I gave after the commit log. Using it, I can get the histogram of the execution time of cmdq_issue_cmdlist(): nsecs : count distribution 0 -> 1 : 0|| 2 -> 3 : 0|| 4 -> 7 : 0|| 8 -> 15 : 0|| 16 -> 31 : 0|| 32 -> 63 : 0|| 64 -> 127: 0|| 128 -> 255: 0|| 256 -> 511: 0|| 512 -> 1023 : 58 || 1024 -> 2047 : 22763|| 2048 -> 4095 : 13238|*** | I feel it is very common to do this kind of things for analyzing the performance issue. For example, to easy the analysis of softirq latency, softirq.c has the below code: asmlinkage __visible void __softirq_entry __do_softirq(void) { ... trace_softirq_entry(vec_nr); h->action(h); trace_softirq_exit(vec_nr); ... } > > Will Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH] iommu/arm-smmu-v3: add tracepoints for cmdq_issue_cmdlist
> -Original Message- > From: Jean-Philippe Brucker [mailto:jean-phili...@linaro.org] > Sent: Friday, August 28, 2020 7:41 PM > To: Song Bao Hua (Barry Song) > Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org; > robin.mur...@arm.com; w...@kernel.org; Linuxarm > Subject: Re: [PATCH] iommu/arm-smmu-v3: add tracepoints for > cmdq_issue_cmdlist > > Hi, > > On Thu, Aug 27, 2020 at 09:33:51PM +1200, Barry Song wrote: > > cmdq_issue_cmdlist() is the hotspot that uses a lot of time. This > > patch adds tracepoints for it to help debug. > > > > Signed-off-by: Barry Song > > --- > > * can furthermore develop an eBPF program to benchmark using this > > trace > > Have you tried using kprobe and kretprobe instead of tracepoints? > Any noticeable performance drop? Yes. Pls read this email. kprobe overhead and OPTPROBES implementation on ARM64 https://www.spinics.net/lists/arm-kernel/msg828788.html > > Thanks, > Jean > > > > > cmdlistlat.c: > > #include > > > > BPF_HASH(start, u32); > > BPF_HISTOGRAM(dist); > > > > TRACEPOINT_PROBE(arm_smmu_v3, issue_cmdlist_entry) { > > u32 pid; > > u64 ts, *val; > > > > pid = bpf_get_current_pid_tgid(); > > ts = bpf_ktime_get_ns(); > > start.update(&pid, &ts); > > return 0; > > } > > > > TRACEPOINT_PROBE(arm_smmu_v3, issue_cmdlist_exit) { > > u32 pid; > > u64 *tsp, delta; > > > > pid = bpf_get_current_pid_tgid(); > > tsp = start.lookup(&pid); > > > > if (tsp != 0) { > > delta = bpf_ktime_get_ns() - *tsp; > > dist.increment(bpf_log2l(delta)); > > start.delete(&pid); > > } > > > > return 0; > > } > > > > cmdlistlat.py: > > #!/usr/bin/python3 > > # > > from __future__ import print_function > > from bcc import BPF > > from ctypes import c_ushort, c_int, c_ulonglong from time import sleep > > from sys import argv > > > > def usage(): > > print("USAGE: %s [interval [count]]" % argv[0]) > > exit() > > > > # arguments > > interval = 5 > > count = -1 > > if len(argv) > 1: > > try: > > interval = int(argv[1]) > > if interval == 0: > > raise > > if len(argv) > 2: > > count = int(argv[2]) > > except: # also catches -h, --help > > usage() > > > > # load BPF program > > b = BPF(src_file = "cmdlistlat.c") > > > > # header > > print("Tracing... Hit Ctrl-C to end.") > > > > # output > > loop = 0 > > do_exit = 0 > > while (1): > > if count > 0: > > loop += 1 > > if loop > count: > > exit() > > try: > > sleep(interval) > > except KeyboardInterrupt: > > pass; do_exit = 1 > > > > print() > > b["dist"].print_log2_hist("nsecs") > > b["dist"].clear() > > if do_exit: > > exit() > > > > > > drivers/iommu/arm/arm-smmu-v3/Makefile| 1 + > > .../iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h | 48 > +++ > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 8 > > 3 files changed, 57 insertions(+) > > create mode 100644 > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h > > > > diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile > > b/drivers/iommu/arm/arm-smmu-v3/Makefile > > index 569e24e9f162..dba1087f91f3 100644 > > --- a/drivers/iommu/arm/arm-smmu-v3/Makefile > > +++ b/drivers/iommu/arm/arm-smmu-v3/Makefile > > @@ -1,2 +1,3 @@ > > # SPDX-License-Identifier: GPL-2.0 > > +ccflags-y += -I$(src) # needed for trace events > > obj-$(CONFIG_ARM_SMMU_V3) += arm-smmu-v3.o diff --git > > a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h > > b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h > > new file mode 100644 > > index ..29ab96706124 > > --- /dev/null > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h > > @@ -0,0 +1,48 @@ > > +/* SPDX-License-Identifier: GPL-2.0-only */ > > +/* > > + * Copyright (C) 2020 Hisili
[PATCH] iommu/arm-smmu-v3: add tracepoints for cmdq_issue_cmdlist
cmdq_issue_cmdlist() is the hotspot that uses a lot of time. This patch adds tracepoints for it to help debug. Signed-off-by: Barry Song --- * can furthermore develop an eBPF program to benchmark using this trace cmdlistlat.c: #include BPF_HASH(start, u32); BPF_HISTOGRAM(dist); TRACEPOINT_PROBE(arm_smmu_v3, issue_cmdlist_entry) { u32 pid; u64 ts, *val; pid = bpf_get_current_pid_tgid(); ts = bpf_ktime_get_ns(); start.update(&pid, &ts); return 0; } TRACEPOINT_PROBE(arm_smmu_v3, issue_cmdlist_exit) { u32 pid; u64 *tsp, delta; pid = bpf_get_current_pid_tgid(); tsp = start.lookup(&pid); if (tsp != 0) { delta = bpf_ktime_get_ns() - *tsp; dist.increment(bpf_log2l(delta)); start.delete(&pid); } return 0; } cmdlistlat.py: #!/usr/bin/python3 # from __future__ import print_function from bcc import BPF from ctypes import c_ushort, c_int, c_ulonglong from time import sleep from sys import argv def usage(): print("USAGE: %s [interval [count]]" % argv[0]) exit() # arguments interval = 5 count = -1 if len(argv) > 1: try: interval = int(argv[1]) if interval == 0: raise if len(argv) > 2: count = int(argv[2]) except: # also catches -h, --help usage() # load BPF program b = BPF(src_file = "cmdlistlat.c") # header print("Tracing... Hit Ctrl-C to end.") # output loop = 0 do_exit = 0 while (1): if count > 0: loop += 1 if loop > count: exit() try: sleep(interval) except KeyboardInterrupt: pass; do_exit = 1 print() b["dist"].print_log2_hist("nsecs") b["dist"].clear() if do_exit: exit() drivers/iommu/arm/arm-smmu-v3/Makefile| 1 + .../iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h | 48 +++ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 8 3 files changed, 57 insertions(+) create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile b/drivers/iommu/arm/arm-smmu-v3/Makefile index 569e24e9f162..dba1087f91f3 100644 --- a/drivers/iommu/arm/arm-smmu-v3/Makefile +++ b/drivers/iommu/arm/arm-smmu-v3/Makefile @@ -1,2 +1,3 @@ # SPDX-License-Identifier: GPL-2.0 +ccflags-y += -I$(src) # needed for trace events obj-$(CONFIG_ARM_SMMU_V3) += arm-smmu-v3.o diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h new file mode 100644 index ..29ab96706124 --- /dev/null +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h @@ -0,0 +1,48 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2020 Hisilicon Limited. + */ + +#undef TRACE_SYSTEM +#define TRACE_SYSTEM arm_smmu_v3 + +#if !defined(_ARM_SMMU_V3_TRACE_H) || defined(TRACE_HEADER_MULTI_READ) +#define _ARM_SMMU_V3_TRACE_H + +#include + +struct device; + +DECLARE_EVENT_CLASS(issue_cmdlist_class, + TP_PROTO(struct device *dev, int n, bool sync), + TP_ARGS(dev, n, sync), + + TP_STRUCT__entry( + __string(device, dev_name(dev)) + __field(int, n) + __field(bool, sync) + ), + TP_fast_assign( + __assign_str(device, dev_name(dev)); + __entry->n = n; + __entry->sync = sync; + ), + TP_printk("%s cmd number=%d sync=%d", + __get_str(device), __entry->n, __entry->sync) +); + +#define DEFINE_ISSUE_CMDLIST_EVENT(name) \ +DEFINE_EVENT(issue_cmdlist_class, name,\ + TP_PROTO(struct device *dev, int n, bool sync), \ + TP_ARGS(dev, n, sync)) + +DEFINE_ISSUE_CMDLIST_EVENT(issue_cmdlist_entry); +DEFINE_ISSUE_CMDLIST_EVENT(issue_cmdlist_exit); + +#endif /* _ARM_SMMU_V3_TRACE_H */ + +#undef TRACE_INCLUDE_PATH +#undef TRACE_INCLUDE_FILE +#define TRACE_INCLUDE_PATH . +#define TRACE_INCLUDE_FILE arm-smmu-v3-trace +#include diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 7332251dd8cd..e2d7d5f1d234 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -33,6 +33,8 @@ #include +#include "arm-smmu-v3-trace.h" + /* MMIO registers */ #define ARM_SMMU_IDR0 0x0 #define IDR0_ST_LVLGENMASK(28, 27) @@ -1389,6 +1391,8 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu, }, head = llq; int ret = 0; + trace_issue_cmdlist_entry(smmu->dev, n, sync); + /* 1. Allocate
[PATCH v5 1/3] iommu/arm-smmu-v3: replace symbolic permissions by octal permissions for module parameter
This fixed the below checkpatch issue: WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using octal permissions '0444'. 417: FILE: drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:417: module_param_named(disable_bypass, disable_bypass, bool, S_IRUGO); Reviewed-by: Robin Murphy Signed-off-by: Barry Song --- -v5: add Robin's reviewed-by drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 7196207be7ea..eea5f7c6d9ab 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -414,7 +414,7 @@ #define MSI_IOVA_LENGTH0x10 static bool disable_bypass = 1; -module_param_named(disable_bypass, disable_bypass, bool, S_IRUGO); +module_param_named(disable_bypass, disable_bypass, bool, 0444); MODULE_PARM_DESC(disable_bypass, "Disable bypass streams such that incoming transactions from devices that are not attached to an iommu domain will report an abort back to the device and will not be allowed to pass through the SMMU."); -- 2.27.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v5 0/3] iommu/arm-smmu-v3: permit users to disable msi polling
patch 1/3 and patch 2/3 are the preparation of patch 3/3 which permits users to disable MSI-based polling by cmd line. -v5: add Robin's reviewed-by -v4: with respect to Robin's comments * cleanup the code of the existing module parameter disable_bypass * add ARM_SMMU_OPT_MSIPOLL flag. on the other hand, we only need to check a bit in options rather than two bits in features Barry Song (3): iommu/arm-smmu-v3: replace symbolic permissions by octal permissions for module parameter iommu/arm-smmu-v3: replace module_param_named by module_param for disable_bypass iommu/arm-smmu-v3: permit users to disable msi polling drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) -- 2.27.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v5 2/3] iommu/arm-smmu-v3: replace module_param_named by module_param for disable_bypass
Just use module_param() - going out of the way to specify a "different" name that's identical to the variable name is silly. Reviewed-by: Robin Murphy Signed-off-by: Barry Song --- -v5: add Robin's reviewed-by drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index eea5f7c6d9ab..5b40d535a7c8 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -414,7 +414,7 @@ #define MSI_IOVA_LENGTH0x10 static bool disable_bypass = 1; -module_param_named(disable_bypass, disable_bypass, bool, 0444); +module_param(disable_bypass, bool, 0444); MODULE_PARM_DESC(disable_bypass, "Disable bypass streams such that incoming transactions from devices that are not attached to an iommu domain will report an abort back to the device and will not be allowed to pass through the SMMU."); -- 2.27.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v5 3/3] iommu/arm-smmu-v3: permit users to disable msi polling
Polling by MSI isn't necessarily faster than polling by SEV. Tests on hi1620 show hns3 100G NIC network throughput can improve from 25G to 27G if we disable MSI polling while running 16 netperf threads sending UDP packets in size 32KB. TX throughput can improve from 7G to 7.7G for single thread. The reason for the throughput improvement is that the latency to poll the completion of CMD_SYNC becomes smaller. After sending a CMD_SYNC in an empty cmd queue, typically we need to wait for 280ns using MSI polling. But we only need around 190ns after disabling MSI polling. This patch provides a command line option so that users can decide to use MSI polling or not based on their tests. Reviewed-by: Robin Murphy Signed-off-by: Barry Song --- -v5: add Robin's reviewed-by drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 5b40d535a7c8..7332251dd8cd 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -418,6 +418,11 @@ module_param(disable_bypass, bool, 0444); MODULE_PARM_DESC(disable_bypass, "Disable bypass streams such that incoming transactions from devices that are not attached to an iommu domain will report an abort back to the device and will not be allowed to pass through the SMMU."); +static bool disable_msipolling; +module_param(disable_msipolling, bool, 0444); +MODULE_PARM_DESC(disable_msipolling, + "Disable MSI-based polling for CMD_SYNC completion."); + enum pri_resp { PRI_RESP_DENY = 0, PRI_RESP_FAIL = 1, @@ -652,6 +657,7 @@ struct arm_smmu_device { #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0) #define ARM_SMMU_OPT_PAGE0_REGS_ONLY (1 << 1) +#define ARM_SMMU_OPT_MSIPOLL (1 << 2) u32 options; struct arm_smmu_cmdqcmdq; @@ -992,8 +998,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct arm_smmu_device *smmu, * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI * payload, so the write will zero the entire command on that platform. */ - if (smmu->features & ARM_SMMU_FEAT_MSI && - smmu->features & ARM_SMMU_FEAT_COHERENCY) { + if (smmu->options & ARM_SMMU_OPT_MSIPOLL) { ent.sync.msiaddr = q->base_dma + Q_IDX(&q->llq, prod) * q->ent_dwords * 8; } @@ -1332,8 +1337,7 @@ static int __arm_smmu_cmdq_poll_until_consumed(struct arm_smmu_device *smmu, static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device *smmu, struct arm_smmu_ll_queue *llq) { - if (smmu->features & ARM_SMMU_FEAT_MSI && - smmu->features & ARM_SMMU_FEAT_COHERENCY) + if (smmu->options & ARM_SMMU_OPT_MSIPOLL) return __arm_smmu_cmdq_poll_until_msi(smmu, llq); return __arm_smmu_cmdq_poll_until_consumed(smmu, llq); @@ -3741,8 +3745,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) if (reg & IDR0_SEV) smmu->features |= ARM_SMMU_FEAT_SEV; - if (reg & IDR0_MSI) + if (reg & IDR0_MSI) { smmu->features |= ARM_SMMU_FEAT_MSI; + if (coherent && !disable_msipolling) + smmu->options |= ARM_SMMU_OPT_MSIPOLL; + } if (reg & IDR0_HYP) smmu->features |= ARM_SMMU_FEAT_HYP; -- 2.27.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v8 1/3] dma-contiguous: provide the ability to reserve per-numa CMA
Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get coherent DMA buffers to save their command queues and page tables. As there is only one default CMA in the whole system, SMMUs on nodes other than node0 will get remote memory. This leads to significant latency. This patch provides per-numa CMA so that drivers like SMMU can get local memory. Tests show localizing CMA can decrease dma_unmap latency much. For instance, before this patch, SMMU on node2 has to wait for more than 560ns for the completion of CMD_SYNC in an empty command queue; with this patch, it needs 240ns only. A positive side effect of this patch would be improving performance even further for those users who are worried about performance more than DMA security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all drivers can get local coherent DMA buffers. Also, this patch changes the default CONFIG_CMA_AREAS to 19 in NUMA. As 1+CONFIG_CMA_AREAS should be quite enough for most servers on the market even they enable both hugetlb_cma and pernuma_cma. 2 numa nodes: 2(hugetlb) + 2(pernuma) + 1(default global cma) = 5 4 numa nodes: 4(hugetlb) + 4(pernuma) + 1(default global cma) = 9 8 numa nodes: 8(hugetlb) + 8(pernuma) + 1(default global cma) = 17 Cc: Randy Dunlap Cc: Mike Kravetz Cc: Jonathan Cameron Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Will Deacon Cc: Robin Murphy Cc: Ganapatrao Kulkarni Cc: Catalin Marinas Cc: Nicolas Saenz Julienne Cc: Steve Capper Cc: Andrew Morton Cc: Mike Rapoport Signed-off-by: Barry Song --- -v8: * rename parameter from pernuma_cma to cma_pernuma with respect to the comments of Mike Rapoport and Randy Dunlap * if both hugetlb_cma and pernuma_cma are enabled, we may need a larger default CMA_AREAS. In numa, we set it to 19 based on the discussion with Mike Kravetz .../admin-guide/kernel-parameters.txt | 11 ++ include/linux/dma-contiguous.h| 6 ++ kernel/dma/Kconfig| 11 ++ kernel/dma/contiguous.c | 100 -- mm/Kconfig| 3 +- 5 files changed, 120 insertions(+), 11 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index bdc1f33fd3d1..8291e2e7a99c 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -599,6 +599,17 @@ altogether. For more information, see include/linux/dma-contiguous.h + cma_pernuma=nn[MG] + [ARM64,KNL] + Sets the size of kernel per-numa memory area for + contiguous memory allocations. A value of 0 disables + per-numa CMA altogether. And If this option is not + specificed, the default value is 0. + With per-numa CMA enabled, DMA users on node nid will + first try to allocate buffer from the pernuma area + which is located in node nid, if the allocation fails, + they will fallback to the global default memory area. + cmo_free_hint= [PPC] Format: { yes | no } Specify whether pages are marked as being inactive when they are freed. This is used in CMO environments diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h index 03f8e98e3bcc..fe55e004f1f4 100644 --- a/include/linux/dma-contiguous.h +++ b/include/linux/dma-contiguous.h @@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct device *dev, struct page *page, #endif +#ifdef CONFIG_DMA_PERNUMA_CMA +void dma_pernuma_cma_reserve(void); +#else +static inline void dma_pernuma_cma_reserve(void) { } +#endif + #endif #endif diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index 847a9d1fa634..0ddfb5510fe4 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -118,6 +118,17 @@ config DMA_CMA If unsure, say "n". if DMA_CMA + +config DMA_PERNUMA_CMA + bool "Enable separate DMA Contiguous Memory Area for each NUMA Node" + default NUMA && ARM64 + help + Enable this option to get pernuma CMA areas so that devices like + ARM64 SMMU can get local memory by DMA coherent APIs. + + You can set the size of pernuma CMA by specifying "cma_pernuma=size" + on the kernel's command line. + comment "Default contiguous memory area size:" config CMA_SIZE_MBYTES diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c index cff7e60968b9..aa53384fd7dc 100644 --- a/kernel/dma/contiguous.c +++ b/kernel/dma/contiguous.c @@ -69,6 +69,19 @@ static int __init early_cma(char *p) } early_param("cma", early_cma); +#ifdef CONFIG_DMA_PERNUMA_CMA + +static
[PATCH v8 2/3] arm64: mm: reserve per-numa CMA to localize coherent dma buffers
Right now, smmu is using dma_alloc_coherent() to get memory to save queues and tables. Typically, on ARM64 server, there is a default CMA located at node0, which could be far away from node2, node3 etc. with this patch, smmu will get memory from local numa node to save command queues and page tables. that means dma_unmap latency will be shrunk much. Meanwhile, when iommu.passthrough is on, device drivers which call dma_ alloc_coherent() will also get local memory and avoid the travel between numa nodes. Acked-by: Will Deacon Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Robin Murphy Cc: Ganapatrao Kulkarni Cc: Catalin Marinas Cc: Nicolas Saenz Julienne Cc: Steve Capper Cc: Andrew Morton Cc: Mike Rapoport Signed-off-by: Barry Song --- arch/arm64/mm/init.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 481d22c32a2e..f1c75957ff3c 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -429,6 +429,8 @@ void __init bootmem_init(void) arm64_hugetlb_cma_reserve(); #endif + dma_pernuma_cma_reserve(); + /* * sparse_init() tries to allocate memory from memblock, so must be * done after the fixed reservations -- 2.27.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v8 0/3] make dma_alloc_coherent NUMA-aware by per-NUMA CMA
Ganapatrao Kulkarni has put some effort on making arm-smmu-v3 use local memory to save command queues[1]. I also did similar job in patch "iommu/arm-smmu-v3: allocate the memory of queues in local numa node" [2] while not realizing Ganapatrao has done that before. But it seems it is much better to make dma_alloc_coherent() to be inherently NUMA-aware on NUMA-capable systems. Right now, smmu is using dma_alloc_coherent() to get memory to save queues and tables. Typically, on ARM64 server, there is a default CMA located at node0, which could be far away from node2, node3 etc. Saving queues and tables remotely will increase the latency of ARM SMMU significantly. For example, when SMMU is at node2 and the default global CMA is at node0, after sending a CMD_SYNC in an empty command queue, we have to wait more than 550ns for the completion of the command CMD_SYNC. However, if we save them locally, we only need to wait for 240ns. with per-numa CMA, smmu will get memory from local numa node to save command queues and page tables. that means dma_unmap latency will be shrunk much. Meanwhile, when iommu.passthrough is on, device drivers which call dma_ alloc_coherent() will also get local memory and avoid the travel between numa nodes. [1] https://lists.linuxfoundation.org/pipermail/iommu/2017-October/024455.html [2] https://www.spinics.net/lists/iommu/msg44767.html -v8: * rename parameter from pernuma_cma to cma_pernuma with respect to the comments of Mike Rapoport and Randy Dunlap * if both hugetlb_cma and pernuma_cma are enabled, we may need a larger default CMA_AREAS. In numa, we set it to 19 based on the discussion with Mike Kravetz -v7: * add Will's acked-by * some cleanup with respect to Will's comments * add patch 3/3 to remove the hardcode of defining the size of cma name. this patch requires some header file change in include/linux -v6: * rebase on top of 5.9-rc1 * doc cleanup -v5: refine code according to Christoph Hellwig's comments * remove Kconfig option for pernuma cma size; * add Kconfig option for pernuma cma enable; * code cleanup like line over 80 char I haven't removed the cma NULL check code in cma_alloc() as it requires a bundle of other changes. So I prefer to handle this issue separately. -v4: * rebase on top of Christoph Hellwig's patch: [PATCH v2] dma-contiguous: cleanup dma_alloc_contiguous https://lore.kernel.org/linux-iommu/20200723120133.94105-1-...@lst.de/ * cleanup according to Christoph's comment * rebase on top of linux-next to avoid arch/arm64 conflicts * reserve cma by checking N_MEMORY rather than N_ONLINE -v3: * move to use page_to_nid() while freeing cma with respect to Robin's comment, but this will only work after applying my below patch: "mm/cma.c: use exact_nid true to fix possible per-numa cma leak" https://marc.info/?l=linux-mm&m=159333034726647&w=2 * handle the case count <= 1 more properly according to Robin's comment; * add pernuma_cma parameter to support dynamic setting of per-numa cma size; ideally we can leverage the CMA_SIZE_MBYTES, CMA_SIZE_PERCENTAGE and "cma=" kernel parameter and avoid a new paramter separately for per- numa cma. Practically, it is really too complicated considering the below problems: (1) if we leverage the size of default numa for per-numa, we have to avoid creating two cma with same size in node0 since default cma is probably on node0. (2) default cma can consider the address limitation for old devices while per-numa cma doesn't support GFP_DMA and GFP_DMA32. all allocations with limitation flags will fallback to default one. (3) hard to apply CMA_SIZE_PERCENTAGE to per-numa. it is hard to decide if the percentage should apply to the whole memory size or only apply to the memory size of a specific numa node. (4) default cma size has CMA_SIZE_SEL_MIN and CMA_SIZE_SEL_MAX, it makes things even more complicated to per-numa cma. I haven't figured out a good way to leverage the size of default cma for per-numa cma. it seems a separate parameter for per-numa could make life easier. * move dma_pernuma_cma_reserve() after hugetlb_cma_reserve() to reuse the comment before hugetlb_cma_reserve() with respect to Robin's comment -v2: * fix some issues reported by kernel test robot * fallback to default cma while allocation fails in per-numa cma free memory properly Barry Song (3): dma-contiguous: provide the ability to reserve per-numa CMA arm64: mm: reserve per-numa CMA to localize coherent dma buffers mm: cma: use CMA_MAX_NAME to define the length of cma name array .../admin-guide/kernel-parameters.txt | 11 ++ arch/arm64/mm/init.c | 2 + include/linux/cma.h | 2 + include/linux/dma-contiguous.h| 6 ++ kernel/dma/Kconfig
[PATCH v8 3/3] mm: cma: use CMA_MAX_NAME to define the length of cma name array
CMA_MAX_NAME should be visible to CMA's users as they might need it to set the name of CMA areas and avoid hardcoding the size locally. So this patch moves CMA_MAX_NAME from local header file to include/linux header file and removes the hardcode in both hugetlb.c and contiguous.c. Cc: Mike Kravetz Cc: Roman Gushchin Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Will Deacon Cc: Robin Murphy Cc: Andrew Morton Signed-off-by: Barry Song --- this patch is fixing the magic number issue with respect to Will's comment here: https://lore.kernel.org/linux-iommu/4ab78767553f48a584217063f6f24...@hisilicon.com/ include/linux/cma.h | 2 ++ kernel/dma/contiguous.c | 2 +- mm/cma.h| 2 -- mm/hugetlb.c| 4 ++-- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/cma.h b/include/linux/cma.h index 6ff79fefd01f..217999c8a762 100644 --- a/include/linux/cma.h +++ b/include/linux/cma.h @@ -18,6 +18,8 @@ #endif +#define CMA_MAX_NAME 64 + struct cma; extern unsigned long totalcma_pages; diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c index aa53384fd7dc..f4c150810fd2 100644 --- a/kernel/dma/contiguous.c +++ b/kernel/dma/contiguous.c @@ -119,7 +119,7 @@ void __init dma_pernuma_cma_reserve(void) for_each_online_node(nid) { int ret; - char name[20]; + char name[CMA_MAX_NAME]; struct cma **cma = &dma_contiguous_pernuma_area[nid]; snprintf(name, sizeof(name), "pernuma%d", nid); diff --git a/mm/cma.h b/mm/cma.h index 20f6e24bc477..42ae082cb067 100644 --- a/mm/cma.h +++ b/mm/cma.h @@ -4,8 +4,6 @@ #include -#define CMA_MAX_NAME 64 - struct cma { unsigned long base_pfn; unsigned long count; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a301c2d672bf..9eec0ea9ba68 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5683,12 +5683,12 @@ void __init hugetlb_cma_reserve(int order) reserved = 0; for_each_node_state(nid, N_ONLINE) { int res; - char name[20]; + char name[CMA_MAX_NAME]; size = min(per_node, hugetlb_cma_size - reserved); size = round_up(size, PAGE_SIZE << order); - snprintf(name, 20, "hugetlb%d", nid); + snprintf(name, sizeof(name), "hugetlb%d", nid); res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order, 0, false, name, &hugetlb_cma[nid], nid); -- 2.27.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by per-NUMA CMA
> -Original Message- > From: Song Bao Hua (Barry Song) > Sent: Saturday, August 22, 2020 7:27 AM > To: 'Mike Kravetz' ; h...@lst.de; > m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org; > ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com; > a...@linux-foundation.org > Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org; > linux-ker...@vger.kernel.org; Zengtao (B) ; > huangdaode ; Linuxarm > Subject: RE: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by > per-NUMA CMA > > > > > -Original Message- > > From: Mike Kravetz [mailto:mike.krav...@oracle.com] > > Sent: Saturday, August 22, 2020 5:53 AM > > To: Song Bao Hua (Barry Song) ; h...@lst.de; > > m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org; > > ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com; > > a...@linux-foundation.org > > Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org; > > linux-ker...@vger.kernel.org; Zengtao (B) ; > > huangdaode ; Linuxarm > > > Subject: Re: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by > > per-NUMA CMA > > > > Hi Barry, > > Sorry for jumping in so late. > > > > On 8/21/20 4:33 AM, Barry Song wrote: > > > > > > with per-numa CMA, smmu will get memory from local numa node to save > > command > > > queues and page tables. that means dma_unmap latency will be shrunk > > much. > > > > Since per-node CMA areas for hugetlb was introduced, I have been thinking > > about the limited number of CMA areas. In most configurations, I believe > > it is limited to 7. And, IIRC it is not something that can be changed at > > runtime, you need to reconfig and rebuild to increase the number. In > contrast > > some configs have NODES_SHIFT set to 10. I wasn't too worried because of > > the limited hugetlb use case. However, this series is adding another user > > of per-node CMA areas. > > > > With more users, should try to sync up number of CMA areas and number of > > nodes? Or, perhaps I am worrying about nothing? > > Hi Mike, > The current limitation is 8. If the server has 4 nodes and we enable both > pernuma > CMA and hugetlb, the last node will fail to get one cma area as the default > global cma area will take 1 of 8. So users need to change menuconfig. > If the server has 8 nodes, we enable one of pernuma cma and hugetlb, one > node > will fail to get cma. > > We may set the default number of CMA areas as 8+MAX_NODES(if hugetlb > enabled) + > MAX_NODES(if pernuma cma enabled) if we don't expect users to change > config, but > right now hugetlb has not an option in Kconfig to enable or disable like > pernuma cma > has DMA_PERNUMA_CMA. I would prefer we make some changes like: config CMA_AREAS int "Maximum count of the CMA areas" depends on CMA + default 19 if NUMA default 7 help CMA allows to create CMA areas for particular purpose, mainly, used as device private area. This parameter sets the maximum number of CMA area in the system. - If unsure, leave the default value "7". + If unsure, leave the default value "7" or "19" if NUMA is used. 1+ CONFIG_CMA_AREAS should be quite enough for almost all servers in the markets. If 2 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*2 + 1 = 5 If 4 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*4 + 1 = 9-> default ARM64 config. If 8 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*8 + 1 = 17 The default value is supporting the most common case and is not going to support those servers with NODES_SHIFT=10, they can make their own config just like users need to increase CMA_AREAS if they add many cma areas in device tree in a system even without NUMA. How do you think, mike? Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by per-NUMA CMA
> -Original Message- > From: Mike Kravetz [mailto:mike.krav...@oracle.com] > Sent: Saturday, August 22, 2020 5:53 AM > To: Song Bao Hua (Barry Song) ; h...@lst.de; > m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org; > ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com; > a...@linux-foundation.org > Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org; > linux-ker...@vger.kernel.org; Zengtao (B) ; > huangdaode ; Linuxarm > Subject: Re: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by > per-NUMA CMA > > Hi Barry, > Sorry for jumping in so late. > > On 8/21/20 4:33 AM, Barry Song wrote: > > > > with per-numa CMA, smmu will get memory from local numa node to save > command > > queues and page tables. that means dma_unmap latency will be shrunk > much. > > Since per-node CMA areas for hugetlb was introduced, I have been thinking > about the limited number of CMA areas. In most configurations, I believe > it is limited to 7. And, IIRC it is not something that can be changed at > runtime, you need to reconfig and rebuild to increase the number. In contrast > some configs have NODES_SHIFT set to 10. I wasn't too worried because of > the limited hugetlb use case. However, this series is adding another user > of per-node CMA areas. > > With more users, should try to sync up number of CMA areas and number of > nodes? Or, perhaps I am worrying about nothing? Hi Mike, The current limitation is 8. If the server has 4 nodes and we enable both pernuma CMA and hugetlb, the last node will fail to get one cma area as the default global cma area will take 1 of 8. So users need to change menuconfig. If the server has 8 nodes, we enable one of pernuma cma and hugetlb, one node will fail to get cma. We may set the default number of CMA areas as 8+MAX_NODES(if hugetlb enabled) + MAX_NODES(if pernuma cma enabled) if we don't expect users to change config, but right now hugetlb has not an option in Kconfig to enable or disable like pernuma cma has DMA_PERNUMA_CMA. > -- > Mike Kravetz Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve per-numa CMA
> -Original Message- > From: Randy Dunlap [mailto:rdun...@infradead.org] > Sent: Saturday, August 22, 2020 4:08 AM > To: Song Bao Hua (Barry Song) ; h...@lst.de; > m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org; > ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com; > a...@linux-foundation.org > Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org; > linux-ker...@vger.kernel.org; Zengtao (B) ; > huangdaode ; Linuxarm ; > Jonathan Cameron ; Nicolas Saenz Julienne > ; Steve Capper ; Mike > Rapoport > Subject: Re: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve > per-numa CMA > > On 8/21/20 4:33 AM, Barry Song wrote: > > --- > > -v7: with respect to Will's comments > > * move to use for_each_online_node > > * add description if users don't specify pernuma_cma > > * provide default value for CONFIG_DMA_PERNUMA_CMA > > > > .../admin-guide/kernel-parameters.txt | 11 ++ > > include/linux/dma-contiguous.h| 6 ++ > > kernel/dma/Kconfig| 11 ++ > > kernel/dma/contiguous.c | 100 > -- > > 4 files changed, 118 insertions(+), 10 deletions(-) > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > > index bdc1f33fd3d1..c609527fc35a 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -599,6 +599,17 @@ > > altogether. For more information, see > > include/linux/dma-contiguous.h > > > > + pernuma_cma=nn[MG] > > + [ARM64,KNL] > > + Sets the size of kernel per-numa memory area for > > + contiguous memory allocations. A value of 0 disables > > + per-numa CMA altogether. And If this option is not > > + specificed, the default value is 0. > > + With per-numa CMA enabled, DMA users on node nid will > > + first try to allocate buffer from the pernuma area > > + which is located in node nid, if the allocation fails, > > + they will fallback to the global default memory area. > > + > > Entries in kernel-parameters.txt are supposed to be in alphabetical order > but this one is not. If you want to keep it near the cma= entry, you can > rename it like Mike suggested. Otherwise it needs to be moved. As I've replied in Mike's comment, I'd like to rename it to cma_per... > > > > cmo_free_hint= [PPC] Format: { yes | no } > > Specify whether pages are marked as being inactive > > when they are freed. This is used in CMO environments > > > > -- > ~Randy Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve per-numa CMA
> -Original Message- > From: Mike Rapoport [mailto:r...@linux.ibm.com] > Sent: Saturday, August 22, 2020 2:28 AM > To: Song Bao Hua (Barry Song) > Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com; > w...@kernel.org; ganapatrao.kulka...@cavium.com; > catalin.mari...@arm.com; a...@linux-foundation.org; > iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org; > linux-ker...@vger.kernel.org; Zengtao (B) ; > huangdaode ; Linuxarm ; > Jonathan Cameron ; Nicolas Saenz Julienne > ; Steve Capper > Subject: Re: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve > per-numa CMA > > On Fri, Aug 21, 2020 at 11:33:53PM +1200, Barry Song wrote: > > Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get > > coherent DMA buffers to save their command queues and page tables. As > > there is only one default CMA in the whole system, SMMUs on nodes other > > than node0 will get remote memory. This leads to significant latency. > > > > This patch provides per-numa CMA so that drivers like SMMU can get local > > memory. Tests show localizing CMA can decrease dma_unmap latency much. > > For instance, before this patch, SMMU on node2 has to wait for more than > > 560ns for the completion of CMD_SYNC in an empty command queue; with > this > > patch, it needs 240ns only. > > > > A positive side effect of this patch would be improving performance even > > further for those users who are worried about performance more than DMA > > security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all > > drivers can get local coherent DMA buffers. > > > > Cc: Jonathan Cameron > > Cc: Christoph Hellwig > > Cc: Marek Szyprowski > > Cc: Will Deacon > > Cc: Robin Murphy > > Cc: Ganapatrao Kulkarni > > Cc: Catalin Marinas > > Cc: Nicolas Saenz Julienne > > Cc: Steve Capper > > Cc: Andrew Morton > > Cc: Mike Rapoport > > Signed-off-by: Barry Song > > --- > > -v7: with respect to Will's comments > > * move to use for_each_online_node > > * add description if users don't specify pernuma_cma > > * provide default value for CONFIG_DMA_PERNUMA_CMA > > > > .../admin-guide/kernel-parameters.txt | 11 ++ > > include/linux/dma-contiguous.h| 6 ++ > > kernel/dma/Kconfig| 11 ++ > > kernel/dma/contiguous.c | 100 > -- > > 4 files changed, 118 insertions(+), 10 deletions(-) > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > > index bdc1f33fd3d1..c609527fc35a 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -599,6 +599,17 @@ > > altogether. For more information, see > > include/linux/dma-contiguous.h > > > > + pernuma_cma=nn[MG] > > Maybe cma_pernuma or cma_pernode? Sounds good. > > > + [ARM64,KNL] > > + Sets the size of kernel per-numa memory area for > > + contiguous memory allocations. A value of 0 disables > > + per-numa CMA altogether. And If this option is not > > + specificed, the default value is 0. > > + With per-numa CMA enabled, DMA users on node nid will > > + first try to allocate buffer from the pernuma area > > + which is located in node nid, if the allocation fails, > > + they will fallback to the global default memory area. > > + > > cmo_free_hint= [PPC] Format: { yes | no } > > Specify whether pages are marked as being inactive > > when they are freed. This is used in CMO environments > > diff --git a/include/linux/dma-contiguous.h > b/include/linux/dma-contiguous.h > > index 03f8e98e3bcc..fe55e004f1f4 100644 > > --- a/include/linux/dma-contiguous.h > > +++ b/include/linux/dma-contiguous.h > > @@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct > device *dev, struct page *page, > > > > #endif > > > > +#ifdef CONFIG_DMA_PERNUMA_CMA > > +void dma_pernuma_cma_reserve(void); > > +#else > > +static inline void dma_pernuma_cma_reserve(void) { } > > +#endif > > + > > #endif > > > > #endif > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig > > index 847a9d1fa634..c38979d45b13 100
[PATCH v7 2/3] arm64: mm: reserve per-numa CMA to localize coherent dma buffers
Right now, smmu is using dma_alloc_coherent() to get memory to save queues and tables. Typically, on ARM64 server, there is a default CMA located at node0, which could be far away from node2, node3 etc. with this patch, smmu will get memory from local numa node to save command queues and page tables. that means dma_unmap latency will be shrunk much. Meanwhile, when iommu.passthrough is on, device drivers which call dma_ alloc_coherent() will also get local memory and avoid the travel between numa nodes. Acked-by: Will Deacon Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Robin Murphy Cc: Ganapatrao Kulkarni Cc: Catalin Marinas Cc: Nicolas Saenz Julienne Cc: Steve Capper Cc: Andrew Morton Cc: Mike Rapoport Signed-off-by: Barry Song --- -v7: add Will's acked-by arch/arm64/mm/init.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 481d22c32a2e..f1c75957ff3c 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -429,6 +429,8 @@ void __init bootmem_init(void) arm64_hugetlb_cma_reserve(); #endif + dma_pernuma_cma_reserve(); + /* * sparse_init() tries to allocate memory from memblock, so must be * done after the fixed reservations -- 2.27.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v7 3/3] mm: cma: use CMA_MAX_NAME to define the length of cma name array
CMA_MAX_NAME should be visible to CMA's users as they might need it to set the name of CMA areas and avoid hardcoding the size locally. So this patch moves CMA_MAX_NAME from local header file to include/linux header file and removes the magic number in hugetlb.c and contiguous.c. Cc: Mike Kravetz Cc: Roman Gushchin Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Will Deacon Cc: Robin Murphy Cc: Andrew Morton Signed-off-by: Barry Song --- this patch is fixing the magic number issue with respect to Will's comment here: https://lore.kernel.org/linux-iommu/4ab78767553f48a584217063f6f24...@hisilicon.com/ include/linux/cma.h | 2 ++ kernel/dma/contiguous.c | 2 +- mm/cma.h| 2 -- mm/hugetlb.c| 4 ++-- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/cma.h b/include/linux/cma.h index 6ff79fefd01f..217999c8a762 100644 --- a/include/linux/cma.h +++ b/include/linux/cma.h @@ -18,6 +18,8 @@ #endif +#define CMA_MAX_NAME 64 + struct cma; extern unsigned long totalcma_pages; diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c index 0383c9b86715..d2d6b715c274 100644 --- a/kernel/dma/contiguous.c +++ b/kernel/dma/contiguous.c @@ -119,7 +119,7 @@ void __init dma_pernuma_cma_reserve(void) for_each_online_node(nid) { int ret; - char name[20]; + char name[CMA_MAX_NAME]; struct cma **cma = &dma_contiguous_pernuma_area[nid]; snprintf(name, sizeof(name), "pernuma%d", nid); diff --git a/mm/cma.h b/mm/cma.h index 20f6e24bc477..42ae082cb067 100644 --- a/mm/cma.h +++ b/mm/cma.h @@ -4,8 +4,6 @@ #include -#define CMA_MAX_NAME 64 - struct cma { unsigned long base_pfn; unsigned long count; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a301c2d672bf..9eec0ea9ba68 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5683,12 +5683,12 @@ void __init hugetlb_cma_reserve(int order) reserved = 0; for_each_node_state(nid, N_ONLINE) { int res; - char name[20]; + char name[CMA_MAX_NAME]; size = min(per_node, hugetlb_cma_size - reserved); size = round_up(size, PAGE_SIZE << order); - snprintf(name, 20, "hugetlb%d", nid); + snprintf(name, sizeof(name), "hugetlb%d", nid); res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order, 0, false, name, &hugetlb_cma[nid], nid); -- 2.27.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v7 1/3] dma-contiguous: provide the ability to reserve per-numa CMA
Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get coherent DMA buffers to save their command queues and page tables. As there is only one default CMA in the whole system, SMMUs on nodes other than node0 will get remote memory. This leads to significant latency. This patch provides per-numa CMA so that drivers like SMMU can get local memory. Tests show localizing CMA can decrease dma_unmap latency much. For instance, before this patch, SMMU on node2 has to wait for more than 560ns for the completion of CMD_SYNC in an empty command queue; with this patch, it needs 240ns only. A positive side effect of this patch would be improving performance even further for those users who are worried about performance more than DMA security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all drivers can get local coherent DMA buffers. Cc: Jonathan Cameron Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Will Deacon Cc: Robin Murphy Cc: Ganapatrao Kulkarni Cc: Catalin Marinas Cc: Nicolas Saenz Julienne Cc: Steve Capper Cc: Andrew Morton Cc: Mike Rapoport Signed-off-by: Barry Song --- -v7: with respect to Will's comments * move to use for_each_online_node * add description if users don't specify pernuma_cma * provide default value for CONFIG_DMA_PERNUMA_CMA .../admin-guide/kernel-parameters.txt | 11 ++ include/linux/dma-contiguous.h| 6 ++ kernel/dma/Kconfig| 11 ++ kernel/dma/contiguous.c | 100 -- 4 files changed, 118 insertions(+), 10 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index bdc1f33fd3d1..c609527fc35a 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -599,6 +599,17 @@ altogether. For more information, see include/linux/dma-contiguous.h + pernuma_cma=nn[MG] + [ARM64,KNL] + Sets the size of kernel per-numa memory area for + contiguous memory allocations. A value of 0 disables + per-numa CMA altogether. And If this option is not + specificed, the default value is 0. + With per-numa CMA enabled, DMA users on node nid will + first try to allocate buffer from the pernuma area + which is located in node nid, if the allocation fails, + they will fallback to the global default memory area. + cmo_free_hint= [PPC] Format: { yes | no } Specify whether pages are marked as being inactive when they are freed. This is used in CMO environments diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h index 03f8e98e3bcc..fe55e004f1f4 100644 --- a/include/linux/dma-contiguous.h +++ b/include/linux/dma-contiguous.h @@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct device *dev, struct page *page, #endif +#ifdef CONFIG_DMA_PERNUMA_CMA +void dma_pernuma_cma_reserve(void); +#else +static inline void dma_pernuma_cma_reserve(void) { } +#endif + #endif #endif diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index 847a9d1fa634..c38979d45b13 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -118,6 +118,17 @@ config DMA_CMA If unsure, say "n". if DMA_CMA + +config DMA_PERNUMA_CMA + bool "Enable separate DMA Contiguous Memory Area for each NUMA Node" + default NUMA && ARM64 + help + Enable this option to get pernuma CMA areas so that devices like + ARM64 SMMU can get local memory by DMA coherent APIs. + + You can set the size of pernuma CMA by specifying "pernuma_cma=size" + on the kernel's command line. + comment "Default contiguous memory area size:" config CMA_SIZE_MBYTES diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c index cff7e60968b9..0383c9b86715 100644 --- a/kernel/dma/contiguous.c +++ b/kernel/dma/contiguous.c @@ -69,6 +69,19 @@ static int __init early_cma(char *p) } early_param("cma", early_cma); +#ifdef CONFIG_DMA_PERNUMA_CMA + +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES]; +static phys_addr_t pernuma_size_bytes __initdata; + +static int __init early_pernuma_cma(char *p) +{ + pernuma_size_bytes = memparse(p, &p); + return 0; +} +early_param("pernuma_cma", early_pernuma_cma); +#endif + #ifdef CONFIG_CMA_SIZE_PERCENTAGE static phys_addr_t __init __maybe_unused cma_early_percent_memory(void) @@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t cma_early_percent_memory(void) #endif +#ifdef CONFIG_DMA_PERNUMA_CMA +void __init dma_pernuma_cma_rese
[PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by per-NUMA CMA
Ganapatrao Kulkarni has put some effort on making arm-smmu-v3 use local memory to save command queues[1]. I also did similar job in patch "iommu/arm-smmu-v3: allocate the memory of queues in local numa node" [2] while not realizing Ganapatrao has done that before. But it seems it is much better to make dma_alloc_coherent() to be inherently NUMA-aware on NUMA-capable systems. Right now, smmu is using dma_alloc_coherent() to get memory to save queues and tables. Typically, on ARM64 server, there is a default CMA located at node0, which could be far away from node2, node3 etc. Saving queues and tables remotely will increase the latency of ARM SMMU significantly. For example, when SMMU is at node2 and the default global CMA is at node0, after sending a CMD_SYNC in an empty command queue, we have to wait more than 550ns for the completion of the command CMD_SYNC. However, if we save them locally, we only need to wait for 240ns. with per-numa CMA, smmu will get memory from local numa node to save command queues and page tables. that means dma_unmap latency will be shrunk much. Meanwhile, when iommu.passthrough is on, device drivers which call dma_ alloc_coherent() will also get local memory and avoid the travel between numa nodes. [1] https://lists.linuxfoundation.org/pipermail/iommu/2017-October/024455.html [2] https://www.spinics.net/lists/iommu/msg44767.html -v7: * add Will's acked-by for the change in arch/arm64 * some cleanup with respect to Will's comments * add patch 3/3 to remove the hardcode of defining the size of cma name. this patch requires some header file change in include/linux -v6: * rebase on top of 5.9-rc1 * doc cleanup -v5: refine code according to Christoph Hellwig's comments * remove Kconfig option for pernuma cma size; * add Kconfig option for pernuma cma enable; * code cleanup like line over 80 char I haven't removed the cma NULL check code in cma_alloc() as it requires a bundle of other changes. So I prefer to handle this issue separately. -v4: * rebase on top of Christoph Hellwig's patch: [PATCH v2] dma-contiguous: cleanup dma_alloc_contiguous https://lore.kernel.org/linux-iommu/20200723120133.94105-1-...@lst.de/ * cleanup according to Christoph's comment * rebase on top of linux-next to avoid arch/arm64 conflicts * reserve cma by checking N_MEMORY rather than N_ONLINE -v3: * move to use page_to_nid() while freeing cma with respect to Robin's comment, but this will only work after applying my below patch: "mm/cma.c: use exact_nid true to fix possible per-numa cma leak" https://marc.info/?l=linux-mm&m=159333034726647&w=2 * handle the case count <= 1 more properly according to Robin's comment; * add pernuma_cma parameter to support dynamic setting of per-numa cma size; ideally we can leverage the CMA_SIZE_MBYTES, CMA_SIZE_PERCENTAGE and "cma=" kernel parameter and avoid a new paramter separately for per- numa cma. Practically, it is really too complicated considering the below problems: (1) if we leverage the size of default numa for per-numa, we have to avoid creating two cma with same size in node0 since default cma is probably on node0. (2) default cma can consider the address limitation for old devices while per-numa cma doesn't support GFP_DMA and GFP_DMA32. all allocations with limitation flags will fallback to default one. (3) hard to apply CMA_SIZE_PERCENTAGE to per-numa. it is hard to decide if the percentage should apply to the whole memory size or only apply to the memory size of a specific numa node. (4) default cma size has CMA_SIZE_SEL_MIN and CMA_SIZE_SEL_MAX, it makes things even more complicated to per-numa cma. I haven't figured out a good way to leverage the size of default cma for per-numa cma. it seems a separate parameter for per-numa could make life easier. * move dma_pernuma_cma_reserve() after hugetlb_cma_reserve() to reuse the comment before hugetlb_cma_reserve() with respect to Robin's comment -v2: * fix some issues reported by kernel test robot * fallback to default cma while allocation fails in per-numa cma free memory properly Barry Song (3): dma-contiguous: provide the ability to reserve per-numa CMA arm64: mm: reserve per-numa CMA to localize coherent dma buffers mm: cma: use CMA_MAX_NAME to define the length of cma name array .../admin-guide/kernel-parameters.txt | 11 ++ arch/arm64/mm/init.c | 2 + include/linux/cma.h | 2 + include/linux/dma-contiguous.h| 6 ++ kernel/dma/Kconfig| 11 ++ kernel/dma/contiguous.c | 100 -- mm/cma.h | 2 - mm/hugetlb.c | 4 +- 8 files changed, 124 insertions(+), 14 deletions(-) -- 2.27.0 ___
RE: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve per-numa CMA
> -Original Message- > From: Will Deacon [mailto:w...@kernel.org] > Sent: Friday, August 21, 2020 9:27 PM > To: Song Bao Hua (Barry Song) > Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com; > ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com; > iommu@lists.linux-foundation.org; Linuxarm ; > linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org; > huangdaode ; Jonathan Cameron > ; Nicolas Saenz Julienne > ; Steve Capper ; Andrew > Morton ; Mike Rapoport > Subject: Re: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve > per-numa CMA > > On Fri, Aug 21, 2020 at 09:13:39AM +, Song Bao Hua (Barry Song) wrote: > > > > > > > -Original Message- > > > From: Will Deacon [mailto:w...@kernel.org] > > > Sent: Friday, August 21, 2020 8:47 PM > > > To: Song Bao Hua (Barry Song) > > > Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com; > > > ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com; > > > iommu@lists.linux-foundation.org; Linuxarm ; > > > linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org; > > > huangdaode ; Jonathan Cameron > > > ; Nicolas Saenz Julienne > > > ; Steve Capper ; > > > Andrew Morton ; Mike Rapoport > > > > > > Subject: Re: [PATCH v6 1/2] dma-contiguous: provide the ability to > > > reserve per-numa CMA > > > > > > On Fri, Aug 21, 2020 at 02:26:14PM +1200, Barry Song wrote: > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > > > b/Documentation/admin-guide/kernel-parameters.txt > > > > index bdc1f33fd3d1..3f33b89aeab5 100644 > > > > --- a/Documentation/admin-guide/kernel-parameters.txt > > > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > > > @@ -599,6 +599,15 @@ > > > > altogether. For more information, see > > > > include/linux/dma-contiguous.h > > > > > > > > + pernuma_cma=nn[MG] > > > > + [ARM64,KNL] > > > > + Sets the size of kernel per-numa memory area for > > > > + contiguous memory allocations. A value of 0 > > > > disables > > > > + per-numa CMA altogether. DMA users on node nid > > > > will > > > > + first try to allocate buffer from the pernuma > > > > area > > > > + which is located in node nid, if the allocation > > > > fails, > > > > + they will fallback to the global default memory > > > > area. > > > > > > What is the default behaviour if this option is not specified? Seems > > > like that should be mentioned here. > > Just wanted to make sure you didn't miss this ^^ If it is not specified, the default size is 0 that means pernuma_cma is disabled. Will put some words for this. > > > > > > > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index > > > > 847a9d1fa634..db7a37ed35eb 100644 > > > > --- a/kernel/dma/Kconfig > > > > +++ b/kernel/dma/Kconfig > > > > @@ -118,6 +118,16 @@ config DMA_CMA > > > > If unsure, say "n". > > > > > > > > if DMA_CMA > > > > + > > > > +config DMA_PERNUMA_CMA > > > > + bool "Enable separate DMA Contiguous Memory Area for each > NUMA > > > Node" > > > > > > I don't understand the need for this config option. If you have > > > DMA_DMA and you have NUMA, why wouldn't you want this enabled? > > > > Christoph preferred this in previous patchset in order to be able to > > remove all of the code in the text if users don't use pernuma CMA. > > Ok, I defer to Christoph here, but maybe a "default NUMA" might work? maybe "default NUMA && ARM64"? Though I believe it will benefit x86, but I don't have a x86 server hardware and real scenario to test. So I haven't put the dma_pernuma_cma_reserve() code in arch/x86. Hopefully some x86 guys will bring it up and remove the "&& ARM64". > > > > > + help > > > > + Enable this option to get pernuma CMA areas so that devices > > > > like > > > > + ARM64 SMMU can get local memory by DMA coherent APIs. > > > > + > > > > + You can set the size o
RE: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve per-numa CMA
> -Original Message- > From: Will Deacon [mailto:w...@kernel.org] > Sent: Friday, August 21, 2020 8:47 PM > To: Song Bao Hua (Barry Song) > Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com; > ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com; > iommu@lists.linux-foundation.org; Linuxarm ; > linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org; > huangdaode ; Jonathan Cameron > ; Nicolas Saenz Julienne > ; Steve Capper ; Andrew > Morton ; Mike Rapoport > Subject: Re: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve > per-numa CMA > > On Fri, Aug 21, 2020 at 02:26:14PM +1200, Barry Song wrote: > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > > index bdc1f33fd3d1..3f33b89aeab5 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -599,6 +599,15 @@ > > altogether. For more information, see > > include/linux/dma-contiguous.h > > > > + pernuma_cma=nn[MG] > > + [ARM64,KNL] > > + Sets the size of kernel per-numa memory area for > > + contiguous memory allocations. A value of 0 disables > > + per-numa CMA altogether. DMA users on node nid will > > + first try to allocate buffer from the pernuma area > > + which is located in node nid, if the allocation fails, > > + they will fallback to the global default memory area. > > What is the default behaviour if this option is not specified? Seems like > that should be mentioned here. > > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig > > index 847a9d1fa634..db7a37ed35eb 100644 > > --- a/kernel/dma/Kconfig > > +++ b/kernel/dma/Kconfig > > @@ -118,6 +118,16 @@ config DMA_CMA > > If unsure, say "n". > > > > if DMA_CMA > > + > > +config DMA_PERNUMA_CMA > > + bool "Enable separate DMA Contiguous Memory Area for each NUMA > Node" > > I don't understand the need for this config option. If you have DMA_DMA and > you have NUMA, why wouldn't you want this enabled? Christoph preferred this in previous patchset in order to be able to remove all of the code in the text if users don't use pernuma CMA. > > > + help > > + Enable this option to get pernuma CMA areas so that devices like > > + ARM64 SMMU can get local memory by DMA coherent APIs. > > + > > + You can set the size of pernuma CMA by specifying > "pernuma_cma=size" > > + on the kernel's command line. > > + > > comment "Default contiguous memory area size:" > > > > config CMA_SIZE_MBYTES > > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c > > index cff7e60968b9..89b95f10e56d 100644 > > --- a/kernel/dma/contiguous.c > > +++ b/kernel/dma/contiguous.c > > @@ -69,6 +69,19 @@ static int __init early_cma(char *p) > > } > > early_param("cma", early_cma); > > > > +#ifdef CONFIG_DMA_PERNUMA_CMA > > + > > +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES]; > > +static phys_addr_t pernuma_size_bytes __initdata; > > + > > +static int __init early_pernuma_cma(char *p) > > +{ > > + pernuma_size_bytes = memparse(p, &p); > > + return 0; > > +} > > +early_param("pernuma_cma", early_pernuma_cma); > > +#endif > > + > > #ifdef CONFIG_CMA_SIZE_PERCENTAGE > > > > static phys_addr_t __init __maybe_unused > cma_early_percent_memory(void) > > @@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t > cma_early_percent_memory(void) > > > > #endif > > > > +#ifdef CONFIG_DMA_PERNUMA_CMA > > +void __init dma_pernuma_cma_reserve(void) > > +{ > > + int nid; > > + > > + if (!pernuma_size_bytes) > > + return; > > If this is useful (I assume it is), then I think we should have a non-zero > default value, a bit like normal CMA does via CMA_SIZE_MBYTES. The patchet used to have a CONFIG_PERNUMA_CMA_SIZE in kernel/dma/Kconfig, but Christoph was not comfortable with it: https://lore.kernel.org/linux-iommu/20200728115231.ga...@lst.de/ Would you mind to hardcode the value in CONFIG_CMDLINE in arch/arm64/Kconfig as Christoph mentioned: config CMDLINE default "pernuma_cma=16M" If you also don't like the change in arch/arm64/Kconfig CMDLINE, I guess I have to depend on u
RE: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve per-numa CMA
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Randy Dunlap > Sent: Friday, August 21, 2020 2:50 PM > To: Song Bao Hua (Barry Song) ; h...@lst.de; > m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org; > ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com > Cc: iommu@lists.linux-foundation.org; Linuxarm ; > linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org; > huangdaode ; Jonathan Cameron > ; Nicolas Saenz Julienne > ; Steve Capper ; Andrew > Morton ; Mike Rapoport > Subject: Re: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve > per-numa CMA > > On 8/20/20 7:26 PM, Barry Song wrote: > > > > > > Cc: Jonathan Cameron > > Cc: Christoph Hellwig > > Cc: Marek Szyprowski > > Cc: Will Deacon > > Cc: Robin Murphy > > Cc: Ganapatrao Kulkarni > > Cc: Catalin Marinas > > Cc: Nicolas Saenz Julienne > > Cc: Steve Capper > > Cc: Andrew Morton > > Cc: Mike Rapoport > > Signed-off-by: Barry Song > > --- > > v6: rebase on top of 5.9-rc1; > > doc cleanup > > > > .../admin-guide/kernel-parameters.txt | 9 ++ > > include/linux/dma-contiguous.h| 6 ++ > > kernel/dma/Kconfig| 10 ++ > > kernel/dma/contiguous.c | 100 > -- > > 4 files changed, 115 insertions(+), 10 deletions(-) > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > > index bdc1f33fd3d1..3f33b89aeab5 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -599,6 +599,15 @@ > > altogether. For more information, see > > include/linux/dma-contiguous.h > > > > + pernuma_cma=nn[MG] > > memparse() allows any one of these suffixes: K, M, G, T, P, E > and nothing in the option parsing function cares what suffix is used... Hello Randy, Thanks for your comments. Actually I am following the suffix of default cma: cma=nn[MG]@[start[MG][-end[MG]]] [ARM,X86,KNL] Sets the size of kernel global memory area for contiguous memory allocations and optionally the placement constraint by the physical address range of memory allocations. A value of 0 disables CMA altogether. For more information, see include/linux/dma-contiguous.h I suggest users should set the size in either MB or GB as they set cma. > > > + [ARM64,KNL] > > + Sets the size of kernel per-numa memory area for > > + contiguous memory allocations. A value of 0 disables > > + per-numa CMA altogether. DMA users on node nid will > > + first try to allocate buffer from the pernuma area > > + which is located in node nid, if the allocation fails, > > + they will fallback to the global default memory area. > > + > > cmo_free_hint= [PPC] Format: { yes | no } > > Specify whether pages are marked as being inactive > > when they are freed. This is used in CMO environments > > > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c > > index cff7e60968b9..89b95f10e56d 100644 > > --- a/kernel/dma/contiguous.c > > +++ b/kernel/dma/contiguous.c > > @@ -69,6 +69,19 @@ static int __init early_cma(char *p) > > } > > early_param("cma", early_cma); > > > > +#ifdef CONFIG_DMA_PERNUMA_CMA > > + > > +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES]; > > +static phys_addr_t pernuma_size_bytes __initdata; > > why phys_addr_t? couldn't it just be unsigned long long? > Mainly because of following the programming habit in kernel/dma/contiguous.c: I think the original code probably meant the size should not be larger than the MAXIMUM value of phys_addr_t: /* * Default global CMA area size can be defined in kernel's .config. * This is useful mainly for distro maintainers to create a kernel * that works correctly for most supported systems. * The size can be set in bytes or as a percentage of the total memory * in the system. * * Users, who want to set the size of global CMA area for their system * should use cma= kernel parameter. */ static const phys_addr_t size_bytes __initconst = (phys_addr_t)CMA
[PATCH v6 1/2] dma-contiguous: provide the ability to reserve per-numa CMA
Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get coherent DMA buffers to save their command queues and page tables. As there is only one default CMA in the whole system, SMMUs on nodes other than node0 will get remote memory. This leads to significant latency. This patch provides per-numa CMA so that drivers like SMMU can get local memory. Tests show localizing CMA can decrease dma_unmap latency much. For instance, before this patch, SMMU on node2 has to wait for more than 560ns for the completion of CMD_SYNC in an empty command queue; with this patch, it needs 240ns only. A positive side effect of this patch would be improving performance even further for those users who are worried about performance more than DMA security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all drivers can get local coherent DMA buffers. Cc: Jonathan Cameron Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Will Deacon Cc: Robin Murphy Cc: Ganapatrao Kulkarni Cc: Catalin Marinas Cc: Nicolas Saenz Julienne Cc: Steve Capper Cc: Andrew Morton Cc: Mike Rapoport Signed-off-by: Barry Song --- v6: rebase on top of 5.9-rc1; doc cleanup .../admin-guide/kernel-parameters.txt | 9 ++ include/linux/dma-contiguous.h| 6 ++ kernel/dma/Kconfig| 10 ++ kernel/dma/contiguous.c | 100 -- 4 files changed, 115 insertions(+), 10 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index bdc1f33fd3d1..3f33b89aeab5 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -599,6 +599,15 @@ altogether. For more information, see include/linux/dma-contiguous.h + pernuma_cma=nn[MG] + [ARM64,KNL] + Sets the size of kernel per-numa memory area for + contiguous memory allocations. A value of 0 disables + per-numa CMA altogether. DMA users on node nid will + first try to allocate buffer from the pernuma area + which is located in node nid, if the allocation fails, + they will fallback to the global default memory area. + cmo_free_hint= [PPC] Format: { yes | no } Specify whether pages are marked as being inactive when they are freed. This is used in CMO environments diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h index 03f8e98e3bcc..fe55e004f1f4 100644 --- a/include/linux/dma-contiguous.h +++ b/include/linux/dma-contiguous.h @@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct device *dev, struct page *page, #endif +#ifdef CONFIG_DMA_PERNUMA_CMA +void dma_pernuma_cma_reserve(void); +#else +static inline void dma_pernuma_cma_reserve(void) { } +#endif + #endif #endif diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index 847a9d1fa634..db7a37ed35eb 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -118,6 +118,16 @@ config DMA_CMA If unsure, say "n". if DMA_CMA + +config DMA_PERNUMA_CMA + bool "Enable separate DMA Contiguous Memory Area for each NUMA Node" + help + Enable this option to get pernuma CMA areas so that devices like + ARM64 SMMU can get local memory by DMA coherent APIs. + + You can set the size of pernuma CMA by specifying "pernuma_cma=size" + on the kernel's command line. + comment "Default contiguous memory area size:" config CMA_SIZE_MBYTES diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c index cff7e60968b9..89b95f10e56d 100644 --- a/kernel/dma/contiguous.c +++ b/kernel/dma/contiguous.c @@ -69,6 +69,19 @@ static int __init early_cma(char *p) } early_param("cma", early_cma); +#ifdef CONFIG_DMA_PERNUMA_CMA + +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES]; +static phys_addr_t pernuma_size_bytes __initdata; + +static int __init early_pernuma_cma(char *p) +{ + pernuma_size_bytes = memparse(p, &p); + return 0; +} +early_param("pernuma_cma", early_pernuma_cma); +#endif + #ifdef CONFIG_CMA_SIZE_PERCENTAGE static phys_addr_t __init __maybe_unused cma_early_percent_memory(void) @@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t cma_early_percent_memory(void) #endif +#ifdef CONFIG_DMA_PERNUMA_CMA +void __init dma_pernuma_cma_reserve(void) +{ + int nid; + + if (!pernuma_size_bytes) + return; + + for_each_node_state(nid, N_ONLINE) { + int ret; + char name[20]; + struct cma **cma = &dma_contiguous_pernuma_area[nid]; + + snprintf(name, sizeof(name), "pernuma%d",
[PATCH v6 0/2] make dma_alloc_coherent NUMA-aware by per-NUMA CMA
Ganapatrao Kulkarni has put some effort on making arm-smmu-v3 use local memory to save command queues[1]. I also did similar job in patch "iommu/arm-smmu-v3: allocate the memory of queues in local numa node" [2] while not realizing Ganapatrao has done that before. But it seems it is much better to make dma_alloc_coherent() to be inherently NUMA-aware on NUMA-capable systems. Right now, smmu is using dma_alloc_coherent() to get memory to save queues and tables. Typically, on ARM64 server, there is a default CMA located at node0, which could be far away from node2, node3 etc. Saving queues and tables remotely will increase the latency of ARM SMMU significantly. For example, when SMMU is at node2 and the default global CMA is at node0, after sending a CMD_SYNC in an empty command queue, we have to wait more than 550ns for the completion of the command CMD_SYNC. However, if we save them locally, we only need to wait for 240ns. with per-numa CMA, smmu will get memory from local numa node to save command queues and page tables. that means dma_unmap latency will be shrunk much. Meanwhile, when iommu.passthrough is on, device drivers which call dma_ alloc_coherent() will also get local memory and avoid the travel between numa nodes. I only have ARM64 server platforms to test, but I believe this patch will benefit X86 somehow. Hopefully, some X86 guys will bring it up on x86. [1] https://lists.linuxfoundation.org/pipermail/iommu/2017-October/024455.html [2] https://www.spinics.net/lists/iommu/msg44767.html -v6: * rebase on top of 5.9-rc1 * doc cleanup -v5: refine code according to Christoph Hellwig's comments * remove Kconfig option for pernuma cma size; * add Kconfig option for pernuma cma enable; * code cleanup like line over 80 char I haven't removed the cma NULL check code in cma_alloc() as it requires a bundle of other changes. So I prefer to handle this issue separately. -v4: * rebase on top of Christoph Hellwig's patch: [PATCH v2] dma-contiguous: cleanup dma_alloc_contiguous https://lore.kernel.org/linux-iommu/20200723120133.94105-1-...@lst.de/ * cleanup according to Christoph's comment * rebase on top of linux-next to avoid arch/arm64 conflicts * reserve cma by checking N_MEMORY rather than N_ONLINE -v3: * move to use page_to_nid() while freeing cma with respect to Robin's comment, but this will only work after applying my below patch: "mm/cma.c: use exact_nid true to fix possible per-numa cma leak" https://marc.info/?l=linux-mm&m=159333034726647&w=2 * handle the case count <= 1 more properly according to Robin's comment; * add pernuma_cma parameter to support dynamic setting of per-numa cma size; ideally we can leverage the CMA_SIZE_MBYTES, CMA_SIZE_PERCENTAGE and "cma=" kernel parameter and avoid a new paramter separately for per- numa cma. Practically, it is really too complicated considering the below problems: (1) if we leverage the size of default numa for per-numa, we have to avoid creating two cma with same size in node0 since default cma is probably on node0. (2) default cma can consider the address limitation for old devices while per-numa cma doesn't support GFP_DMA and GFP_DMA32. all allocations with limitation flags will fallback to default one. (3) hard to apply CMA_SIZE_PERCENTAGE to per-numa. it is hard to decide if the percentage should apply to the whole memory size or only apply to the memory size of a specific numa node. (4) default cma size has CMA_SIZE_SEL_MIN and CMA_SIZE_SEL_MAX, it makes things even more complicated to per-numa cma. I haven't figured out a good way to leverage the size of default cma for per-numa cma. it seems a separate parameter for per-numa could make life easier. * move dma_pernuma_cma_reserve() after hugetlb_cma_reserve() to reuse the comment before hugetlb_cma_reserve() with respect to Robin's comment -v2: * fix some issues reported by kernel test robot * fallback to default cma while allocation fails in per-numa cma free memory properly Barry Song (2): dma-contiguous: provide the ability to reserve per-numa CMA arm64: mm: reserve per-numa CMA to localize coherent dma buffers .../admin-guide/kernel-parameters.txt | 9 ++ arch/arm64/mm/init.c | 2 + include/linux/dma-contiguous.h| 6 ++ kernel/dma/Kconfig| 10 ++ kernel/dma/contiguous.c | 100 -- 5 files changed, 117 insertions(+), 10 deletions(-) -- 2.27.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 2/2] arm64: mm: reserve per-numa CMA to localize coherent dma buffers
Right now, smmu is using dma_alloc_coherent() to get memory to save queues and tables. Typically, on ARM64 server, there is a default CMA located at node0, which could be far away from node2, node3 etc. with this patch, smmu will get memory from local numa node to save command queues and page tables. that means dma_unmap latency will be shrunk much. Meanwhile, when iommu.passthrough is on, device drivers which call dma_ alloc_coherent() will also get local memory and avoid the travel between numa nodes. Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Will Deacon Cc: Robin Murphy Cc: Ganapatrao Kulkarni Cc: Catalin Marinas Cc: Nicolas Saenz Julienne Cc: Steve Capper Cc: Andrew Morton Cc: Mike Rapoport Signed-off-by: Barry Song --- -v6: rebase on top of 5.9-rc1 arch/arm64/mm/init.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 481d22c32a2e..f1c75957ff3c 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -429,6 +429,8 @@ void __init bootmem_init(void) arm64_hugetlb_cma_reserve(); #endif + dma_pernuma_cma_reserve(); + /* * sparse_init() tries to allocate memory from memblock, so must be * done after the fixed reservations -- 2.27.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v4] iommu/arm-smmu-v3: permit users to disable msi polling
> -Original Message- > From: Robin Murphy [mailto:robin.mur...@arm.com] > Sent: Wednesday, August 19, 2020 2:31 AM > To: Song Bao Hua (Barry Song) ; w...@kernel.org; > j...@8bytes.org > Cc: Zengtao (B) ; > iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org; > Linuxarm > Subject: Re: [PATCH v4] iommu/arm-smmu-v3: permit users to disable msi > polling > > On 2020-08-18 12:17, Barry Song wrote: > > Polling by MSI isn't necessarily faster than polling by SEV. Tests on > > hi1620 show hns3 100G NIC network throughput can improve from 25G to > > 27G if we disable MSI polling while running 16 netperf threads sending > > UDP packets in size 32KB. TX throughput can improve from 7G to 7.7G for > > single thread. > > The reason for the throughput improvement is that the latency to poll > > the completion of CMD_SYNC becomes smaller. After sending a CMD_SYNC > > in an empty cmd queue, typically we need to wait for 280ns using MSI > > polling. But we only need around 190ns after disabling MSI polling. > > This patch provides a command line option so that users can decide to > > use MSI polling or not based on their tests. > > > > Signed-off-by: Barry Song > > --- > > -v4: rebase on top of 5.9-rc1 > > refine changelog > > > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 > ++ > > 1 file changed, 14 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > index 7196207be7ea..89d3cb391fef 100644 > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > @@ -418,6 +418,11 @@ module_param_named(disable_bypass, > disable_bypass, bool, S_IRUGO); > > MODULE_PARM_DESC(disable_bypass, > > "Disable bypass streams such that incoming transactions from devices > that are not attached to an iommu domain will report an abort back to the > device and will not be allowed to pass through the SMMU."); > > > > +static bool disable_msipolling; > > +module_param_named(disable_msipolling, disable_msipolling, bool, > S_IRUGO); > > Just use module_param() - going out of the way to specify a "different" > name that's identical to the variable name is silly. Thanks for pointing out, also fixed the same issue in the existing parameter disable_bypass in the new patchset. But I am sorry I made a typo, the new patchset should be v5. But I wrote v4. > > Also I think the preference these days is to specify permissions as > plain octal constants rather than those rather inscrutable macros. I > certainly find that more readable myself. > > (Yes, the existing parameter commits the same offences, but I'd rather > clean that up separately than perpetuate it) Thanks for pointing out. Got fixed in the new patchset. > > > +MODULE_PARM_DESC(disable_msipolling, > > + "Disable MSI-based polling for CMD_SYNC completion."); > > + > > enum pri_resp { > > PRI_RESP_DENY = 0, > > PRI_RESP_FAIL = 1, > > @@ -980,6 +985,13 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, > struct arm_smmu_cmdq_ent *ent) > > return 0; > > } > > > > +static bool arm_smmu_use_msipolling(struct arm_smmu_device *smmu) > > +{ > > + return !disable_msipolling && > > + smmu->features & ARM_SMMU_FEAT_COHERENCY && > > + smmu->features & ARM_SMMU_FEAT_MSI; > > +} > > I'd wrap this up into a new ARM_SMMU_OPT_MSIPOLL flag set at probe time, > rather than constantly reevaluating this whole expression (now that it's > no longer simply testing two adjacent bits of the same word). Got it done in the new patchset. It turns out we only need to check one bit now with the new patch: - if (smmu->features & ARM_SMMU_FEAT_MSI && - smmu->features & ARM_SMMU_FEAT_COHERENCY) + if (smmu->options & ARM_SMMU_OPT_MSIPOLL) return __arm_smmu_cmdq_poll_until_msi(smmu, llq); > > Robin. > > > + > > static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct > arm_smmu_device *smmu, > > u32 prod) > > { > > @@ -992,8 +1004,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 > *cmd, struct arm_smmu_device *smmu, > > * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI > > * payload, so the write will zero the entire command on that platform. > > */ > > - if (smmu->features & ARM_SMMU_FEAT_MSI && &
[PATCH v4 1/3] iommu/arm-smmu-v3: replace symbolic permissions by octal permissions for module parameter
This fixed the below checkpatch issue: WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using octal permissions '0444'. 417: FILE: drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:417: module_param_named(disable_bypass, disable_bypass, bool, S_IRUGO); -v4: * cleanup the existing module parameter of bypass_ * add ARM_SMMU_OPT_MSIPOLL flag with respect to Robin's comments Signed-off-by: Barry Song --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 7196207be7ea..eea5f7c6d9ab 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -414,7 +414,7 @@ #define MSI_IOVA_LENGTH0x10 static bool disable_bypass = 1; -module_param_named(disable_bypass, disable_bypass, bool, S_IRUGO); +module_param_named(disable_bypass, disable_bypass, bool, 0444); MODULE_PARM_DESC(disable_bypass, "Disable bypass streams such that incoming transactions from devices that are not attached to an iommu domain will report an abort back to the device and will not be allowed to pass through the SMMU."); -- 2.27.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 3/3] iommu/arm-smmu-v3: permit users to disable msi polling
Polling by MSI isn't necessarily faster than polling by SEV. Tests on hi1620 show hns3 100G NIC network throughput can improve from 25G to 27G if we disable MSI polling while running 16 netperf threads sending UDP packets in size 32KB. TX throughput can improve from 7G to 7.7G for single thread. The reason for the throughput improvement is that the latency to poll the completion of CMD_SYNC becomes smaller. After sending a CMD_SYNC in an empty cmd queue, typically we need to wait for 280ns using MSI polling. But we only need around 190ns after disabling MSI polling. This patch provides a command line option so that users can decide to use MSI polling or not based on their tests. Signed-off-by: Barry Song --- -v4: add ARM_SMMU_OPT_MSIPOLL flag with respect to Robin's comment drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 5b40d535a7c8..7332251dd8cd 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -418,6 +418,11 @@ module_param(disable_bypass, bool, 0444); MODULE_PARM_DESC(disable_bypass, "Disable bypass streams such that incoming transactions from devices that are not attached to an iommu domain will report an abort back to the device and will not be allowed to pass through the SMMU."); +static bool disable_msipolling; +module_param(disable_msipolling, bool, 0444); +MODULE_PARM_DESC(disable_msipolling, + "Disable MSI-based polling for CMD_SYNC completion."); + enum pri_resp { PRI_RESP_DENY = 0, PRI_RESP_FAIL = 1, @@ -652,6 +657,7 @@ struct arm_smmu_device { #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0) #define ARM_SMMU_OPT_PAGE0_REGS_ONLY (1 << 1) +#define ARM_SMMU_OPT_MSIPOLL (1 << 2) u32 options; struct arm_smmu_cmdqcmdq; @@ -992,8 +998,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct arm_smmu_device *smmu, * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI * payload, so the write will zero the entire command on that platform. */ - if (smmu->features & ARM_SMMU_FEAT_MSI && - smmu->features & ARM_SMMU_FEAT_COHERENCY) { + if (smmu->options & ARM_SMMU_OPT_MSIPOLL) { ent.sync.msiaddr = q->base_dma + Q_IDX(&q->llq, prod) * q->ent_dwords * 8; } @@ -1332,8 +1337,7 @@ static int __arm_smmu_cmdq_poll_until_consumed(struct arm_smmu_device *smmu, static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device *smmu, struct arm_smmu_ll_queue *llq) { - if (smmu->features & ARM_SMMU_FEAT_MSI && - smmu->features & ARM_SMMU_FEAT_COHERENCY) + if (smmu->options & ARM_SMMU_OPT_MSIPOLL) return __arm_smmu_cmdq_poll_until_msi(smmu, llq); return __arm_smmu_cmdq_poll_until_consumed(smmu, llq); @@ -3741,8 +3745,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) if (reg & IDR0_SEV) smmu->features |= ARM_SMMU_FEAT_SEV; - if (reg & IDR0_MSI) + if (reg & IDR0_MSI) { smmu->features |= ARM_SMMU_FEAT_MSI; + if (coherent && !disable_msipolling) + smmu->options |= ARM_SMMU_OPT_MSIPOLL; + } if (reg & IDR0_HYP) smmu->features |= ARM_SMMU_FEAT_HYP; -- 2.27.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 0/3] iommu/arm-smmu-v3: permit users to disable msi polling
patch 1/3 and patch 2/3 are the preparation of patch 3/3 which permits users to disable MSI-based polling by cmd line. -v4: with respect to Robin's comments * cleanup the code of the existing module parameter disable_bypass * add ARM_SMMU_OPT_MSIPOLL flag. on the other hand, we only need to check a bit in options rather than two bits in features Barry Song (3): iommu/arm-smmu-v3: replace symbolic permissions by octal permissions for module parameter iommu/arm-smmu-v3: replace module_param_named by module_param for disable_bypass iommu/arm-smmu-v3: permit users to disable msi polling drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) -- 2.27.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 2/3] iommu/arm-smmu-v3: replace module_param_named by module_param for disable_bypass
Just use module_param() - going out of the way to specify a "different" name that's identical to the variable name is silly. Signed-off-by: Barry Song --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index eea5f7c6d9ab..5b40d535a7c8 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -414,7 +414,7 @@ #define MSI_IOVA_LENGTH0x10 static bool disable_bypass = 1; -module_param_named(disable_bypass, disable_bypass, bool, 0444); +module_param(disable_bypass, bool, 0444); MODULE_PARM_DESC(disable_bypass, "Disable bypass streams such that incoming transactions from devices that are not attached to an iommu domain will report an abort back to the device and will not be allowed to pass through the SMMU."); -- 2.27.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4] iommu/arm-smmu-v3: permit users to disable msi polling
Polling by MSI isn't necessarily faster than polling by SEV. Tests on hi1620 show hns3 100G NIC network throughput can improve from 25G to 27G if we disable MSI polling while running 16 netperf threads sending UDP packets in size 32KB. TX throughput can improve from 7G to 7.7G for single thread. The reason for the throughput improvement is that the latency to poll the completion of CMD_SYNC becomes smaller. After sending a CMD_SYNC in an empty cmd queue, typically we need to wait for 280ns using MSI polling. But we only need around 190ns after disabling MSI polling. This patch provides a command line option so that users can decide to use MSI polling or not based on their tests. Signed-off-by: Barry Song --- -v4: rebase on top of 5.9-rc1 refine changelog drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 7196207be7ea..89d3cb391fef 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -418,6 +418,11 @@ module_param_named(disable_bypass, disable_bypass, bool, S_IRUGO); MODULE_PARM_DESC(disable_bypass, "Disable bypass streams such that incoming transactions from devices that are not attached to an iommu domain will report an abort back to the device and will not be allowed to pass through the SMMU."); +static bool disable_msipolling; +module_param_named(disable_msipolling, disable_msipolling, bool, S_IRUGO); +MODULE_PARM_DESC(disable_msipolling, + "Disable MSI-based polling for CMD_SYNC completion."); + enum pri_resp { PRI_RESP_DENY = 0, PRI_RESP_FAIL = 1, @@ -980,6 +985,13 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent) return 0; } +static bool arm_smmu_use_msipolling(struct arm_smmu_device *smmu) +{ + return !disable_msipolling && + smmu->features & ARM_SMMU_FEAT_COHERENCY && + smmu->features & ARM_SMMU_FEAT_MSI; +} + static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct arm_smmu_device *smmu, u32 prod) { @@ -992,8 +1004,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct arm_smmu_device *smmu, * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI * payload, so the write will zero the entire command on that platform. */ - if (smmu->features & ARM_SMMU_FEAT_MSI && - smmu->features & ARM_SMMU_FEAT_COHERENCY) { + if (arm_smmu_use_msipolling(smmu)) { ent.sync.msiaddr = q->base_dma + Q_IDX(&q->llq, prod) * q->ent_dwords * 8; } @@ -1332,8 +1343,7 @@ static int __arm_smmu_cmdq_poll_until_consumed(struct arm_smmu_device *smmu, static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device *smmu, struct arm_smmu_ll_queue *llq) { - if (smmu->features & ARM_SMMU_FEAT_MSI && - smmu->features & ARM_SMMU_FEAT_COHERENCY) + if (arm_smmu_use_msipolling(smmu)) return __arm_smmu_cmdq_poll_until_msi(smmu, llq); return __arm_smmu_cmdq_poll_until_consumed(smmu, llq); -- 2.27.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v3] iommu/arm-smmu-v3: permit users to disable MSI polling
> -Original Message- > From: John Garry > Sent: Tuesday, August 4, 2020 3:34 AM > To: Song Bao Hua (Barry Song) ; w...@kernel.org; > robin.mur...@arm.com; j...@8bytes.org; iommu@lists.linux-foundation.org > Cc: Zengtao (B) ; > linux-arm-ker...@lists.infradead.org > Subject: Re: [PATCH v3] iommu/arm-smmu-v3: permit users to disable MSI > polling > > On 01/08/2020 08:47, Barry Song wrote: > > Polling by MSI isn't necessarily faster than polling by SEV. Tests on > > hi1620 show hns3 100G NIC network throughput can improve from 25G to > > 27G if we disable MSI polling while running 16 netperf threads sending > > UDP packets in size 32KB. > > BTW, Do we have any more results than this? This is just one scenario. > John, it is more than a scenario. Micro-benchmark shows polling by SEV has less latency than MSI. This motivated me to use a real scenario to verify. For this network case, if we set thread to 1 rather than 16, network TX through can improve from 7Gbps to 7.7Gbps > How about your micro-benchmark, which allows you to set the number of > CPUs? The micro-benchmark is working like this: Sending A CMD_SYNC in an empty command queue Polling the completion of this CMD_SYNC by MSI or SEV. I have seen the polling latency can decrease by about 80ns. Without this patch, the latency was about ~270ns, after this patch, it would be about ~190ns. > > Thanks, > John > > > This patch provides a command line option so that users can decide to > > use MSI polling or not based on their tests. > > > > Signed-off-by: Barry Song > > --- > > -v3: > >* rebase on top of linux-next as arm-smmu-v3.c has moved; > >* provide a command line option > > > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 > ++ > > 1 file changed, 14 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > index 7196207be7ea..89d3cb391fef 100644 > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > > @@ -418,6 +418,11 @@ module_param_named(disable_bypass, > disable_bypass, bool, S_IRUGO); > > MODULE_PARM_DESC(disable_bypass, > > "Disable bypass streams such that incoming transactions from devices > that are not attached to an iommu domain will report an abort back to the > device and will not be allowed to pass through the SMMU."); > > > > +static bool disable_msipolling; > > +module_param_named(disable_msipolling, disable_msipolling, bool, > S_IRUGO); > > +MODULE_PARM_DESC(disable_msipolling, > > + "Disable MSI-based polling for CMD_SYNC completion."); > > + > > enum pri_resp { > > PRI_RESP_DENY = 0, > > PRI_RESP_FAIL = 1, > > @@ -980,6 +985,13 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, > struct arm_smmu_cmdq_ent *ent) > > return 0; > > } > > > > +static bool arm_smmu_use_msipolling(struct arm_smmu_device *smmu) > > +{ > > + return !disable_msipolling && > > + smmu->features & ARM_SMMU_FEAT_COHERENCY && > > + smmu->features & ARM_SMMU_FEAT_MSI; > > +} > > + > > static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct > arm_smmu_device *smmu, > > u32 prod) > > { > > @@ -992,8 +1004,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 > *cmd, struct arm_smmu_device *smmu, > > * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI > > * payload, so the write will zero the entire command on that platform. > > */ > > - if (smmu->features & ARM_SMMU_FEAT_MSI && > > - smmu->features & ARM_SMMU_FEAT_COHERENCY) { > > + if (arm_smmu_use_msipolling(smmu)) { > > ent.sync.msiaddr = q->base_dma + Q_IDX(&q->llq, prod) * > >q->ent_dwords * 8; > > } > > @@ -1332,8 +1343,7 @@ static int > __arm_smmu_cmdq_poll_until_consumed(struct arm_smmu_device *smmu, > > static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device > *smmu, > > struct arm_smmu_ll_queue *llq) > > { > > - if (smmu->features & ARM_SMMU_FEAT_MSI && > > - smmu->features & ARM_SMMU_FEAT_COHERENCY) > > + if (arm_smmu_use_msipolling(smmu)) > > return __arm_smmu_cmdq_poll_until_msi(smmu, llq); > > > > return __arm_smmu_cmdq_poll_until_consumed(smmu, llq); > > Thanks Barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu