Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA

2021-10-07 Thread Barry Song
On Fri, Oct 8, 2021 at 12:32 AM Jason Gunthorpe  wrote:
>
> On Thu, Oct 07, 2021 at 06:43:33PM +1300, Barry Song wrote:
>
> > So do we have a case where devices can directly access the kernel's data
> > structure such as a list/graph/tree with pointers to a kernel virtual 
> > address?
> > then devices don't need to translate the address of pointers in a structure.
> > I assume this is one of the most useful features userspace SVA can provide.
>
> AFIACT that is the only good case for KVA, but it is also completely
> against the endianess, word size and DMA portability design of the
> kernel.
>
> Going there requires some new set of portable APIs for gobally
> coherent KVA dma.

yep. I agree. it would be very weird if accelerators/gpu are sharing
kernel' data struct, but for each "DMA" operation - reading or writing
the data struct, we have to call dma_map_single/sg or
dma_sync_single_for_cpu/device etc. It seems once devices and cpus
are sharing virtual address(SVA), code doesn't need to do explicit
map/sync each time.

>
> Jason

Thanks
barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA

2021-10-06 Thread Barry Song
On Tue, Oct 5, 2021 at 7:21 AM Jason Gunthorpe  wrote:
>
> On Mon, Oct 04, 2021 at 09:40:03AM -0700, Jacob Pan wrote:
> > Hi Barry,
> >
> > On Sat, 2 Oct 2021 01:45:59 +1300, Barry Song <21cn...@gmail.com> wrote:
> >
> > > >
> > > > > I assume KVA mode can avoid this iotlb flush as the device is using
> > > > > the page table of the kernel and sharing the whole kernel space. But
> > > > > will users be glad to accept this mode?
> > > >
> > > > You can avoid the lock be identity mapping the physical address space
> > > > of the kernel and maping map/unmap a NOP.
> > > >
> > > > KVA is just a different way to achive this identity map with slightly
> > > > different security properties than the normal way, but it doesn't
> > > > reach to the same security level as proper map/unmap.
> > > >
> > > > I'm not sure anyone who cares about DMA security would see value in
> > > > the slight difference between KVA and a normal identity map.
> > >
> > > yes. This is an important question. if users want a high security level,
> > > kva might not their choice; if users don't want the security, they are
> > > using iommu passthrough. So when will users choose KVA?
> > Right, KVAs sit in the middle in terms of performance and security.
> > Performance is better than IOVA due to IOTLB flush as you mentioned. Also
> > not too far behind of pass-through.
>
> The IOTLB flush is not on a DMA path but on a vmap path, so it is very
> hard to compare the two things.. Maybe vmap can be made to do lazy
> IOTLB flush or something and it could be closer
>
> > Security-wise, KVA respects kernel mapping. So permissions are better
> > enforced than pass-through and identity mapping.
>
> Is this meaningful? Isn't the entire physical map still in the KVA and
> isn't it entirely RW ?

Some areas are RX, for example, ARCH64 supports KERNEL_TEXT_RDONLY.
But the difference is really minor.

So do we have a case where devices can directly access the kernel's data
structure such as a list/graph/tree with pointers to a kernel virtual address?
then devices don't need to translate the address of pointers in a structure.
I assume this is one of the most useful features userspace SVA can provide.

But do we have a case where accelerators/GPU want to use the complex data
structures of kernel drivers?

>
> Jason

Thanks
barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA

2021-10-01 Thread Barry Song
On Wed, Sep 22, 2021 at 5:14 PM Jacob Pan  wrote:
>
> Hi Joerg/Jason/Christoph et all,
>
> The current in-kernel supervisor PASID support is based on the SVM/SVA
> machinery in sva-lib. Kernel SVA is achieved by extending a special flag
> to indicate the binding of the device and a page table should be performed
> on init_mm instead of the mm of the current process.Page requests and other
> differences between user and kernel SVA are handled as special cases.
>
> This unrestricted binding with the kernel page table is being challenged
> for security and the convention that in-kernel DMA must be compatible with
> DMA APIs.
> (https://lore.kernel.org/linux-iommu/20210511194726.gp1002...@nvidia.com/)
> There is also the lack of IOTLB synchronization upon kernel page table 
> updates.
>
> This patchset is trying to address these concerns by having an explicit DMA
> API compatible model while continue to support in-kernel use of DMA requests
> with PASID. Specifically, the following DMA-IOMMU APIs are introduced:
>
> int iommu_dma_pasid_enable/disable(struct device *dev,
>struct iommu_domain **domain,
>enum iommu_dma_pasid_mode mode);
> int iommu_map/unmap_kva(struct iommu_domain *domain,
> void *cpu_addr,size_t size, int prot);
>
> The following three addressing modes are supported with example API usages
> by device drivers.
>
> 1. Physical address (bypass) mode. Similar to DMA direct where trusted devices
> can DMA pass through IOMMU on a per PASID basis.
> Example:
> pasid = iommu_dma_pasid_enable(dev, NULL, IOMMU_DMA_PASID_BYPASS);
> /* Use the returning PASID and PA for work submission */
>
> 2. IOVA mode. DMA API compatible. Map a supervisor PASID the same way as the
> PCI requester ID (RID)
> Example:
> pasid = iommu_dma_pasid_enable(dev, NULL, IOMMU_DMA_PASID_IOVA);
> /* Use the PASID and DMA API allocated IOVA for work submission */

Hi Jacob,
might be stupid question, what is the performance benefit of this IOVA
mode comparing
with the current dma_map/unmap_single/sg API which have enabled IOMMU like
drivers/iommu/arm/arm-smmu-v3? Do we still need to flush IOTLB by sending
commands to IOMMU  each time while doing dma_unmap?

>
> 3. KVA mode. New kva map/unmap APIs. Support fast and strict sub-modes
> transparently based on device trustfulness.
> Example:
> pasid = iommu_dma_pasid_enable(dev, , IOMMU_DMA_PASID_KVA);
> iommu_map_kva(domain, , size, prot);
> /* Use the returned PASID and KVA to submit work */
> Where:
> Fast mode: Shared CPU page tables for trusted devices only
> Strict mode: IOMMU domain returned for the untrusted device to
> replicate KVA-PA mapping in IOMMU page tables.

a huge bottleneck of IOMMU we have seen before is that dma_unmap will
require IOTLB
flush, for example, in arm_smmu_cmdq_issue_cmdlist(), we are having
serious contention
on acquiring lock and delay on waiting for iotlb flush completion in
arm_smmu_cmdq_poll_until_sync() while multi-threads run.

I assume KVA mode can avoid this iotlb flush as the device is using
the page table
of the kernel and sharing the whole kernel space. But will users be
glad to accept
this mode?
It seems users are enduring the performance decrease of IOVA mapping
and unmapping
because it has better security. dma operations can only run on some
specific dma buffers
which have been mapped in the current dma-map/unmap with IOMMU backend.
some drivers are using bouncing buffer to overcome the performance loss of
dma_map/unmap as copying is faster than unmapping:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=907676b130711fd1f

BTW, we have been debugging on dma_map/unmap performance by this
benchmark:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/dma/map_benchmark.c
you might be able to use it for your benchmarking as well :-)

>
> On a per device basis, DMA address and performance modes are enabled by the
> device drivers. Platform information such as trustability, user command line
> input (not included in this set) could also be taken into consideration (not
> implemented in this RFC).
>
> This RFC is intended to communicate the API directions. Little testing is done
> outside IDXD and DMA engine tests.
>
> For PA and IOVA modes, the implementation is straightforward and tested with
> Intel IDXD driver. But several opens remain in KVA fast mode thus not tested:
> 1. Lack of IOTLB synchronization, kernel direct map alias can be updated as a
> result of module loading/eBPF load. Adding kernel mmu notifier?
> 2. The use of the auxiliary domain for KVA map, will aux domain stay in the
> long term? Is there another way to represent sub-device granu isolation?
> 3. Is limiting the KVA sharing to the direct map range reasonable and
> practical for all architectures?
>
>
> Many thanks to Ashok Raj, Kevin Tian, and 

Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA

2021-10-01 Thread Barry Song
On Sat, Oct 2, 2021 at 1:36 AM Jason Gunthorpe  wrote:
>
> On Sat, Oct 02, 2021 at 01:24:54AM +1300, Barry Song wrote:
>
> > I assume KVA mode can avoid this iotlb flush as the device is using
> > the page table of the kernel and sharing the whole kernel space. But
> > will users be glad to accept this mode?
>
> You can avoid the lock be identity mapping the physical address space
> of the kernel and maping map/unmap a NOP.
>
> KVA is just a different way to achive this identity map with slightly
> different security properties than the normal way, but it doesn't
> reach to the same security level as proper map/unmap.
>
> I'm not sure anyone who cares about DMA security would see value in
> the slight difference between KVA and a normal identity map.

yes. This is an important question. if users want a high security level,
kva might not their choice; if users don't want the security, they are using
iommu passthrough. So when will users choose KVA?

>
> > which have been mapped in the current dma-map/unmap with IOMMU backend.
> > some drivers are using bouncing buffer to overcome the performance loss of
> > dma_map/unmap as copying is faster than unmapping:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=907676b130711fd1f
>
> It is pretty unforuntate that drivers are hard coding behaviors based
> on assumptions of what the portable API is doing under the covers.

not real when it has a tx_copybreak which can be set by ethtool or
similar userspace
tools . if users are using iommu passthrough, copying won't happen by
the default
tx_copybreak.  if users are using restrict iommu mode, socket buffers
are copied into
the buffers allocated and mapped in the driver. so this won't require
mapping and
unmapping socket buffers frequently.

>
> Jason

Thanks
barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] dma-mapping: benchmark: use the correct HiSilicon copyright

2021-03-31 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: fanghao (A)
> Sent: Tuesday, March 30, 2021 7:34 PM
> To: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com; Song Bao Hua
> (Barry Song) 
> Cc: iommu@lists.linux-foundation.org; linux...@openeuler.org;
> linux-kselft...@vger.kernel.org; fanghao (A) 
> Subject: [PATCH] dma-mapping: benchmark: use the correct HiSilicon copyright
> 
> s/Hisilicon/HiSilicon/g.
> It should use capital S, according to
> https://www.hisilicon.com/en/terms-of-use.
> 

My bad. Thanks.

Acked-by: Barry Song 

> Signed-off-by: Hao Fang 
> ---
>  kernel/dma/map_benchmark.c  | 2 +-
>  tools/testing/selftests/dma/dma_map_benchmark.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> index e0e64f8..00d6549 100644
> --- a/kernel/dma/map_benchmark.c
> +++ b/kernel/dma/map_benchmark.c
> @@ -1,6 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  /*
> - * Copyright (C) 2020 Hisilicon Limited.
> + * Copyright (C) 2020 HiSilicon Limited.
>   */
> 
>  #define pr_fmt(fmt)  KBUILD_MODNAME ": " fmt
> diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c
> b/tools/testing/selftests/dma/dma_map_benchmark.c
> index fb23ce9..b492bed 100644
> --- a/tools/testing/selftests/dma/dma_map_benchmark.c
> +++ b/tools/testing/selftests/dma/dma_map_benchmark.c
> @@ -1,6 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  /*
> - * Copyright (C) 2020 Hisilicon Limited.
> + * Copyright (C) 2020 HiSilicon Limited.
>   */
> 
>  #include 
> --
> 2.8.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] dma-mapping: make map_benchmark compile into module

2021-03-24 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Wednesday, March 24, 2021 8:13 PM
> To: tiantao (H) 
> Cc: a...@linux-foundation.org; pet...@infradead.org; paul...@kernel.org;
> a...@kernel.org; t...@linutronix.de; rost...@goodmis.org; h...@lst.de;
> m.szyprow...@samsung.com; Song Bao Hua (Barry Song)
> ; iommu@lists.linux-foundation.org;
> linux-ker...@vger.kernel.org
> Subject: Re: [PATCH] dma-mapping: make map_benchmark compile into module
> 
> On Wed, Mar 24, 2021 at 10:17:38AM +0800, Tian Tao wrote:
> > under some scenarios, it is necessary to compile map_benchmark
> > into module to test iommu, so this patch changed Kconfig and
> > export_symbol to implement map_benchmark compiled into module.
> >
> > On the other hand, map_benchmark is a driver, which is supposed
> > to be able to run as a module.
> >
> > Signed-off-by: Tian Tao 
> 
> Nope, we're not going to export more kthread internals for a test
> module.

The requirement comes from an colleague who is frequently changing
the map-bench code for some customized test purpose. and he doesn't
want to build kernel image and reboot every time. So I moved the
requirement to Tao Tian.

Right now, kthread_bind() is exported, kthread_bind_mask() seems
to be a little bit "internal" as you said, maybe a wrapper like
kthread_bind_node() won't be that "internal", comparing to exposing
the cpumask?
Anyway, we don't find other driver users for this, hardly I can
convince you it is worth.

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] dma-mapping: make map_benchmark compile into module

2021-03-23 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: tiantao (H)
> Sent: Wednesday, March 24, 2021 3:18 PM
> To: a...@linux-foundation.org; pet...@infradead.org; paul...@kernel.org;
> a...@kernel.org; t...@linutronix.de; rost...@goodmis.org; h...@lst.de;
> m.szyprow...@samsung.com; Song Bao Hua (Barry Song)
> 
> Cc: iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org; tiantao
> (H) 
> Subject: [PATCH] dma-mapping: make map_benchmark compile into module
> 
> under some scenarios, it is necessary to compile map_benchmark
> into module to test iommu, so this patch changed Kconfig and
> export_symbol to implement map_benchmark compiled into module.
> 
> On the other hand, map_benchmark is a driver, which is supposed
> to be able to run as a module.
> 
> Signed-off-by: Tian Tao 
> ---

Acked-by: Barry Song 

Look sensible to me. I like the idea that map_benchmark is
a driver. It seems unreasonable to always require built-in.


>  kernel/dma/Kconfig | 2 +-
>  kernel/kthread.c   | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> index 77b4055..0468293 100644
> --- a/kernel/dma/Kconfig
> +++ b/kernel/dma/Kconfig
> @@ -223,7 +223,7 @@ config DMA_API_DEBUG_SG
> If unsure, say N.
> 
>  config DMA_MAP_BENCHMARK
> - bool "Enable benchmarking of streaming DMA mapping"
> + tristate "Enable benchmarking of streaming DMA mapping"
>   depends on DEBUG_FS
>   help
> Provides /sys/kernel/debug/dma_map_benchmark that helps with testing
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 1578973..fa4736f 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -455,6 +455,7 @@ void kthread_bind_mask(struct task_struct *p, const struct
> cpumask *mask)
>  {
>   __kthread_bind_mask(p, mask, TASK_UNINTERRUPTIBLE);
>  }
> +EXPORT_SYMBOL(kthread_bind_mask);
> 
>  /**
>   * kthread_bind - bind a just-created kthread to a cpu.
> --
> 2.7.4

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] dma-mapping: benchmark: Add support for multi-pages map/unmap

2021-03-18 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: chenxiang (M)
> Sent: Thursday, March 18, 2021 10:30 PM
> To: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com; Song Bao Hua
> (Barry Song) 
> Cc: iommu@lists.linux-foundation.org; linux...@openeuler.org;
> linux-kselft...@vger.kernel.org; chenxiang (M) 
> Subject: [PATCH] dma-mapping: benchmark: Add support for multi-pages map/unmap
> 
> From: Xiang Chen 
> 
> Currently it only support one page map/unmap once a time for dma-map
> benchmark, but there are some other scenaries which need to support for
> multi-page map/unmap: for those multi-pages interfaces such as
> dma_alloc_coherent() and dma_map_sg(), the time spent on multi-pages
> map/unmap is not the time of a single page * npages (not linear) as it
> may use block description instead of page description when it is satified
> with the size such as 2M/1G, and also it can send a single TLB invalidation
> command to invalidate multi-pages instead of multi-times when RIL is
> enabled (which will short the time of unmap). So it is necessary to add
> support for multi-pages map/unmap.
> 
> Add a parameter "-g" to support multi-pages map/unmap.
> 
> Signed-off-by: Xiang Chen 
> ---

Acked-by: Barry Song 

>  kernel/dma/map_benchmark.c  | 21 ++---
>  tools/testing/selftests/dma/dma_map_benchmark.c | 20 
>  2 files changed, 30 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> index e0e64f8..a5c1b01 100644
> --- a/kernel/dma/map_benchmark.c
> +++ b/kernel/dma/map_benchmark.c
> @@ -38,7 +38,8 @@ struct map_benchmark {
>   __u32 dma_bits; /* DMA addressing capability */
>   __u32 dma_dir; /* DMA data direction */
>   __u32 dma_trans_ns; /* time for DMA transmission in ns */
> - __u8 expansion[80]; /* For future use */
> + __u32 granule;  /* how many PAGE_SIZE will do map/unmap once a time */
> + __u8 expansion[76]; /* For future use */
>  };
> 
>  struct map_benchmark_data {
> @@ -58,9 +59,11 @@ static int map_benchmark_thread(void *data)
>   void *buf;
>   dma_addr_t dma_addr;
>   struct map_benchmark_data *map = data;
> + int npages = map->bparam.granule;
> + u64 size = npages * PAGE_SIZE;
>   int ret = 0;
> 
> - buf = (void *)__get_free_page(GFP_KERNEL);
> + buf = alloc_pages_exact(size, GFP_KERNEL);
>   if (!buf)
>   return -ENOMEM;
> 
> @@ -76,10 +79,10 @@ static int map_benchmark_thread(void *data)
>* 66 means evertything goes well! 66 is lucky.
>*/
>   if (map->dir != DMA_FROM_DEVICE)
> - memset(buf, 0x66, PAGE_SIZE);
> + memset(buf, 0x66, size);
> 
>   map_stime = ktime_get();
> - dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE, map->dir);
> + dma_addr = dma_map_single(map->dev, buf, size, map->dir);
>   if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
>   pr_err("dma_map_single failed on %s\n",
>   dev_name(map->dev));
> @@ -93,7 +96,7 @@ static int map_benchmark_thread(void *data)
>   ndelay(map->bparam.dma_trans_ns);
> 
>   unmap_stime = ktime_get();
> - dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, map->dir);
> + dma_unmap_single(map->dev, dma_addr, size, map->dir);
>   unmap_etime = ktime_get();
>   unmap_delta = ktime_sub(unmap_etime, unmap_stime);
> 
> @@ -112,7 +115,7 @@ static int map_benchmark_thread(void *data)
>   }
> 
>  out:
> - free_page((unsigned long)buf);
> + free_pages_exact(buf, size);
>   return ret;
>  }
> 
> @@ -203,7 +206,6 @@ static long map_benchmark_ioctl(struct file *file, 
> unsigned
> int cmd,
>   struct map_benchmark_data *map = file->private_data;
>   void __user *argp = (void __user *)arg;
>   u64 old_dma_mask;
> -
>   int ret;
> 
>   if (copy_from_user(>bparam, argp, sizeof(map->bparam)))
> @@ -234,6 +236,11 @@ static long map_benchmark_ioctl(struct file *file, 
> unsigned
> int cmd,
>   return -EINVAL;
>   }
> 
> + if (map->bparam.granule < 1 || map->bparam.granule > 1024) {
> + pr_err("invalid granule size\n");
> + return -EINVAL;
> + }
> +
>   switch (map->bparam.dma_dir) {
>   case DMA_MAP_BIDIRECTIONAL:
>   map

RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-10 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Thursday, February 11, 2021 7:04 AM
> To: Song Bao Hua (Barry Song) 
> Cc: David Hildenbrand ; Wangzhou (B)
> ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; kevin.t...@intel.com; jean-phili...@linaro.org;
> eric.au...@redhat.com; Liguozhu (Kenneth) ;
> zhangfei@linaro.org; chensihang (A) 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Tue, Feb 09, 2021 at 10:22:47PM +, Song Bao Hua (Barry Song) wrote:
> 
> > The problem is that SVA declares we can use any memory of a process
> > to do I/O. And in real scenarios, we are unable to customize most
> > applications to make them use the pool. So we are looking for some
> > extension generically for applications such as Nginx, Ceph.
> 
> But those applications will suffer jitter even if their are using CPU
> to do the same work. I fail to see why adding an accelerator suddenly
> means the application owner will care about jitter introduced by
> migration/etc.

The only point for this is that when migration occurs on the accelerator,
the impact/jitter is much bigger than it does on CPU. Then the accelerator
might be unhelpful.

> 
> Again in proper SVA it should be quite unlikely to take a fault caused
> by something like migration, on the same likelyhood as the CPU. If
> things are faulting so much this is a problem then I think it is a
> system level problem with doing too much page motion.

My point is that single one SVA application shouldn't require system
to make global changes, such as disabling numa balancing, disabling
THP, to decrease page fault frequency by affecting other applications.

Anyway, guys are in lunar new year. Hopefully, we are getting more
real benchmark data afterwards to make the discussion more targeted.

> 
> Jason

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-09 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Wednesday, February 10, 2021 2:54 AM
> To: Song Bao Hua (Barry Song) 
> Cc: David Hildenbrand ; Wangzhou (B)
> ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; kevin.t...@intel.com; jean-phili...@linaro.org;
> eric.au...@redhat.com; Liguozhu (Kenneth) ;
> zhangfei@linaro.org; chensihang (A) 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Tue, Feb 09, 2021 at 03:01:42AM +, Song Bao Hua (Barry Song) wrote:
> 
> > On the other hand, wouldn't it be the benefit of hardware accelerators
> > to have a lower and more stable latency zip/encryption than CPU?
> 
> No, I don't think so.

Fortunately or unfortunately, I think my people have this target to have
a lower-latency and more stable zip/encryption by using accelerators,
otherwise, they are going to use CPU directly if there is no advantage
of accelerators.

> 
> If this is an important problem then it should apply equally to CPU
> and IO jitter.
> 
> Honestly I find the idea that occasional migration jitters CPU and DMA
> to not be very compelling. Such specialized applications should
> allocate special pages to avoid this, not adding an API to be able to
> lock down any page

That is exactly what we have done to provide a hugeTLB pool so that
applications can allocate memory from this pool.

+---+
 |   |
 |applications using accelerators|
 +---+


 alloc from pool free to pool
   +  ++
   |   |
   |   |
   |   |
   |   |
   |   |
   |   |
   |   |
+--+---+-+
||
||
|  HugeTLB memory pool   |
||
||
++

The problem is that SVA declares we can use any memory of a process
to do I/O. And in real scenarios, we are unable to customize most
applications to make them use the pool. So we are looking for some
extension generically for applications such as Nginx, Ceph.

I am also thinking about leveraging vm.compact_unevictable_allowed
which David suggested and making an extension on it, for example,
permit users to disable compaction and numa balancing on unevictable
pages of SVA process,  which might be a smaller deal.

> 
> Jason

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Tuesday, February 9, 2021 10:30 AM
> To: Song Bao Hua (Barry Song) 
> Cc: David Hildenbrand ; Wangzhou (B)
> ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; kevin.t...@intel.com; jean-phili...@linaro.org;
> eric.au...@redhat.com; Liguozhu (Kenneth) ;
> zhangfei@linaro.org; chensihang (A) 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Mon, Feb 08, 2021 at 08:35:31PM +, Song Bao Hua (Barry Song) wrote:
> >
> >
> > > From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> > > Sent: Tuesday, February 9, 2021 7:34 AM
> > > To: David Hildenbrand 
> > > Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org;
> > > iommu@lists.linux-foundation.org; linux...@kvack.org;
> > > linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> > > Morton ; Alexander Viro
> ;
> > > gre...@linuxfoundation.org; Song Bao Hua (Barry Song)
> > > ; kevin.t...@intel.com;
> > > jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> > > ; zhangfei@linaro.org; chensihang (A)
> > > 
> > > Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide 
> > > memory
> > > pin
> > >
> > > On Mon, Feb 08, 2021 at 09:14:28AM +0100, David Hildenbrand wrote:
> > >
> > > > People are constantly struggling with the effects of long term pinnings
> > > > under user space control, like we already have with vfio and RDMA.
> > > >
> > > > And here we are, adding yet another, easier way to mess with core MM in
> the
> > > > same way. This feels like a step backwards to me.
> > >
> > > Yes, this seems like a very poor candidate to be a system call in this
> > > format. Much too narrow, poorly specified, and possibly security
> > > implications to allow any process whatsoever to pin memory.
> > >
> > > I keep encouraging people to explore a standard shared SVA interface
> > > that can cover all these topics (and no, uaccel is not that
> > > interface), that seems much more natural.
> > >
> > > I still haven't seen an explanation why DMA is so special here,
> > > migration and so forth jitter the CPU too, environments that care
> > > about jitter have to turn this stuff off.
> >
> > This paper has a good explanation:
> > https://ieeexplore.ieee.org/stamp/stamp.jsp?tp==7482091
> >
> > mainly because page fault can go directly to the CPU and we have
> > many CPUs. But IO Page Faults go a different way, thus mean much
> > higher latency 3-80x slower than page fault:
> > events in hardware queue -> Interrupts -> cpu processing page fault
> > -> return events to iommu/device -> continue I/O.
> 
> The justifications for this was migration scenarios and migration is
> short. If you take a fault on what you are migrating only then does it
> slow down the CPU.

I agree this can slow down CPU, but not as much as IO page fault.

On the other hand, wouldn't it be the benefit of hardware accelerators
to have a lower and more stable latency zip/encryption than CPU?

> 
> Are you also working with HW where the IOMMU becomes invalidated after
> a migration and doesn't reload?
> 
> ie not true SVA but the sort of emulated SVA we see in a lot of
> places?

Yes. It is true SVA not emulated SVA.

> 
> It would be much better to work improve that to have closer sync with the
> CPU page table than to use pinning.

Absolutely I agree improving IOPF and making IOPF catch up with the 
performance of page fault is the best way. but it would take much
long time to optimize both HW and SW. While waiting for them to
mature, probably some way which can minimize IOPF should be used to
take the responsivity.

> 
> Jason

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: David Hildenbrand [mailto:da...@redhat.com]
> Sent: Monday, February 8, 2021 11:37 PM
> To: Song Bao Hua (Barry Song) ; Matthew Wilcox
> 
> Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> ; zhangfei@linaro.org; chensihang (A)
> 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On 08.02.21 11:13, Song Bao Hua (Barry Song) wrote:
> >
> >
> >> -Original Message-
> >> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf
> Of
> >> David Hildenbrand
> >> Sent: Monday, February 8, 2021 9:22 PM
> >> To: Song Bao Hua (Barry Song) ; Matthew Wilcox
> >> 
> >> Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org;
> >> iommu@lists.linux-foundation.org; linux...@kvack.org;
> >> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> >> Morton ; Alexander Viro
> ;
> >> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> >> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> >> ; zhangfei@linaro.org; chensihang (A)
> >> 
> >> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> >> pin
> >>
> >> On 08.02.21 03:27, Song Bao Hua (Barry Song) wrote:
> >>>
> >>>
> >>>> -Original Message-
> >>>> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On
> Behalf
> >> Of
> >>>> Matthew Wilcox
> >>>> Sent: Monday, February 8, 2021 2:31 PM
> >>>> To: Song Bao Hua (Barry Song) 
> >>>> Cc: Wangzhou (B) ;
> linux-ker...@vger.kernel.org;
> >>>> iommu@lists.linux-foundation.org; linux...@kvack.org;
> >>>> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> >>>> Morton ; Alexander Viro
> >> ;
> >>>> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> >>>> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> >>>> ; zhangfei@linaro.org; chensihang (A)
> >>>> 
> >>>> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide 
> >>>> memory
> >>>> pin
> >>>>
> >>>> On Sun, Feb 07, 2021 at 10:24:28PM +, Song Bao Hua (Barry Song) 
> >>>> wrote:
> >>>>>>> In high-performance I/O cases, accelerators might want to perform
> >>>>>>> I/O on a memory without IO page faults which can result in 
> >>>>>>> dramatically
> >>>>>>> increased latency. Current memory related APIs could not achieve this
> >>>>>>> requirement, e.g. mlock can only avoid memory to swap to backup 
> >>>>>>> device,
> >>>>>>> page migration can still trigger IO page fault.
> >>>>>>
> >>>>>> Well ... we have two requirements.  The application wants to not take
> >>>>>> page faults.  The system wants to move the application to a different
> >>>>>> NUMA node in order to optimise overall performance.  Why should the
> >>>>>> application's desires take precedence over the kernel's desires?  And
> why
> >>>>>> should it be done this way rather than by the sysadmin using numactl
> to
> >>>>>> lock the application to a particular node?
> >>>>>
> >>>>> NUMA balancer is just one of many reasons for page migration. Even one
> >>>>> simple alloc_pages() can cause memory migration in just single NUMA
> >>>>> node or UMA system.
> >>>>>
> >>>>> The other reasons for page migration include but are not limited to:
> >>>>> * memory move due to CMA
> >>>>> * memory move due to huge pages creation
> >>>>>
> >>>>> Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
> >>>>> in the whole system.
> >>>>
> >>>> You're dodging the question.  Should the CMA allocation fail because
> >>>> another application is usin

RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Tuesday, February 9, 2021 7:34 AM
> To: David Hildenbrand 
> Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; Song Bao Hua (Barry Song)
> ; kevin.t...@intel.com;
> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> ; zhangfei@linaro.org; chensihang (A)
> 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Mon, Feb 08, 2021 at 09:14:28AM +0100, David Hildenbrand wrote:
> 
> > People are constantly struggling with the effects of long term pinnings
> > under user space control, like we already have with vfio and RDMA.
> >
> > And here we are, adding yet another, easier way to mess with core MM in the
> > same way. This feels like a step backwards to me.
> 
> Yes, this seems like a very poor candidate to be a system call in this
> format. Much too narrow, poorly specified, and possibly security
> implications to allow any process whatsoever to pin memory.
> 
> I keep encouraging people to explore a standard shared SVA interface
> that can cover all these topics (and no, uaccel is not that
> interface), that seems much more natural.
> 
> I still haven't seen an explanation why DMA is so special here,
> migration and so forth jitter the CPU too, environments that care
> about jitter have to turn this stuff off.

This paper has a good explanation:
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp==7482091

mainly because page fault can go directly to the CPU and we have
many CPUs. But IO Page Faults go a different way, thus mean much
higher latency 3-80x slower than page fault:
events in hardware queue -> Interrupts -> cpu processing page fault
-> return events to iommu/device -> continue I/O.

Copied from the paper:

If the IOMMU's page table walker fails to find the desired
translation in the page table, it sends an ATS response to
the GPU notifying it of this failure. This in turn corresponds
to a page fault. In response, the GPU sends another request to
the IOMMU called a Peripheral Page Request (PPR). The IOMMU
places this request in a memory-mapped queue and raises an
interrupt on the CPU. Multiple PPR requests can be queued
before the CPU is interrupted. The OS must have a suitable
IOMMU driver to process this interrupt and the queued PPR
requests. In Linux, while in an interrupt context, the driver
pulls PPR requests from the queue and places them in a work-queue
for later processing. Presumably this design decision was made
to minimize the time spent executing in an interrupt context,
where lower priority interrupts would be dis-abled. At a later
time, an OS worker-thread calls back into the driver to process
page fault requests in the work-queue. Once the requests are
serviced, the driver notifies the IOMMU. In turn, the IOMMU
notifies the GPU. The GPU then sends an-other ATS request to
retry the translation for the original fault-ing address.

Comparison with CPU: On the CPU, a hardware excep-tion is
raised on a page fault, which immediately switches to the
OS. In most cases in Linux, this routine services the page
fault directly, instead of queuing it for later processing.
Con-trast this with a page fault from an accelerator, where
the IOMMU has to interrupt the CPU to request service on
its be-half, and also note the several back-and-forth messages
be-tween the accelerator, the IOMMU, and the CPU. Further-more,
page faults on the CPU are generally handled one at a time
on the CPU, while for the GPU they are batched by the IOMMU
and OS work-queue mechanism.

> 
> Jason

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of
> David Hildenbrand
> Sent: Monday, February 8, 2021 9:22 PM
> To: Song Bao Hua (Barry Song) ; Matthew Wilcox
> 
> Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> ; zhangfei@linaro.org; chensihang (A)
> 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On 08.02.21 03:27, Song Bao Hua (Barry Song) wrote:
> >
> >
> >> -Original Message-
> >> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf
> Of
> >> Matthew Wilcox
> >> Sent: Monday, February 8, 2021 2:31 PM
> >> To: Song Bao Hua (Barry Song) 
> >> Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org;
> >> iommu@lists.linux-foundation.org; linux...@kvack.org;
> >> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> >> Morton ; Alexander Viro
> ;
> >> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> >> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> >> ; zhangfei@linaro.org; chensihang (A)
> >> 
> >> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> >> pin
> >>
> >> On Sun, Feb 07, 2021 at 10:24:28PM +, Song Bao Hua (Barry Song) wrote:
> >>>>> In high-performance I/O cases, accelerators might want to perform
> >>>>> I/O on a memory without IO page faults which can result in dramatically
> >>>>> increased latency. Current memory related APIs could not achieve this
> >>>>> requirement, e.g. mlock can only avoid memory to swap to backup device,
> >>>>> page migration can still trigger IO page fault.
> >>>>
> >>>> Well ... we have two requirements.  The application wants to not take
> >>>> page faults.  The system wants to move the application to a different
> >>>> NUMA node in order to optimise overall performance.  Why should the
> >>>> application's desires take precedence over the kernel's desires?  And why
> >>>> should it be done this way rather than by the sysadmin using numactl to
> >>>> lock the application to a particular node?
> >>>
> >>> NUMA balancer is just one of many reasons for page migration. Even one
> >>> simple alloc_pages() can cause memory migration in just single NUMA
> >>> node or UMA system.
> >>>
> >>> The other reasons for page migration include but are not limited to:
> >>> * memory move due to CMA
> >>> * memory move due to huge pages creation
> >>>
> >>> Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
> >>> in the whole system.
> >>
> >> You're dodging the question.  Should the CMA allocation fail because
> >> another application is using SVA?
> >>
> >> I would say no.
> >
> > I would say no as well.
> >
> > While IOMMU is enabled, CMA almost has one user only: IOMMU driver
> > as other drivers will depend on iommu to use non-contiguous memory
> > though they are still calling dma_alloc_coherent().
> >
> > In iommu driver, dma_alloc_coherent is called during initialization
> > and there is no new allocation afterwards. So it wouldn't cause
> > runtime impact on SVA performance. Even there is new allocations,
> > CMA will fall back to general alloc_pages() and iommu drivers are
> > almost allocating small memory for command queues.
> >
> > So I would say general compound pages, huge pages, especially
> > transparent huge pages, would be bigger concerns than CMA for
> > internal page migration within one NUMA.
> >
> > Not like CMA, general alloc_pages() can get memory by moving
> > pages other than those pinned.
> >
> > And there is no guarantee we can always bind the memory of
> > SVA applications to single one NUMA, so NUMA balancing is
> > still a concern.
> >
> > But I agree we need a way to make CMA success while the userspace
> > pages are pinned. Since pin has been viral in many drivers, I
> > assume there is a way to handle this. Otherwise, APIs like
> > V4L2_MEMORY_USERPTR[1] will

RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-07 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: David Rientjes [mailto:rient...@google.com]
> Sent: Monday, February 8, 2021 3:18 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Matthew Wilcox ; Wangzhou (B)
> ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> ; zhangfei@linaro.org; chensihang (A)
> 
> Subject: RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Sun, 7 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > NUMA balancer is just one of many reasons for page migration. Even one
> > simple alloc_pages() can cause memory migration in just single NUMA
> > node or UMA system.
> >
> > The other reasons for page migration include but are not limited to:
> > * memory move due to CMA
> > * memory move due to huge pages creation
> >
> > Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
> > in the whole system.
> >
> 
> What about only for mlocked memory, i.e. disable
> vm.compact_unevictable_allowed?
> 
> Adding syscalls is a big deal, we can make a reasonable inference that
> we'll have to support this forever if it's merged.  I haven't seen mention
> of what other unevictable memory *should* be migratable that would be
> adversely affected if we disable that sysctl.  Maybe that gets you part of
> the way there and there are some other deficiencies, but it seems like a
> good start would be to describe how CONFIG_NUMA_BALANCING=n +
> vm.compact_unevcitable_allowed + mlock() doesn't get you mostly there and
> then look into what's missing.
> 

I believe it can resolve the performance problem for the SVA
applications if we disable vm.compact_unevcitable_allowed and
NUMA_BALANCE, and use mlock().

The problem is that it is insensible to ask users to disable
unevictable_allowed or numa balancing of the whole system
only because there is one SVA application in the system.

SVA, for itself, is a mechanism to let cpu and devices share same
address space. In a typical server system, there are many processes,
the better way would be only changing the behavior of the specific
process rather than changing the whole system. It is hard to ask
users to do that only because there is a SVA monster.
Plus, this might negatively affect those applications not using SVA.

> If it's a very compelling case where there simply are no alternatives, it
> would make sense.  Alternative is to find a more generic way, perhaps in
> combination with vm.compact_unevictable_allowed, to achieve what you're
> looking to do that can be useful even beyond your originally intended use
> case.

sensible. Actually pin is exactly the way to disable migration for specific
pages AKA. disabling "vm.compact_unevictable_allowed" on those pages.

It is hard to differentiate what pages should not be migrated. Only apps
know that as even SVA applications can allocate many non-IO pages which
should be able to move.

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-07 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of
> Matthew Wilcox
> Sent: Monday, February 8, 2021 2:31 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> ; zhangfei@linaro.org; chensihang (A)
> 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Sun, Feb 07, 2021 at 10:24:28PM +, Song Bao Hua (Barry Song) wrote:
> > > > In high-performance I/O cases, accelerators might want to perform
> > > > I/O on a memory without IO page faults which can result in dramatically
> > > > increased latency. Current memory related APIs could not achieve this
> > > > requirement, e.g. mlock can only avoid memory to swap to backup device,
> > > > page migration can still trigger IO page fault.
> > >
> > > Well ... we have two requirements.  The application wants to not take
> > > page faults.  The system wants to move the application to a different
> > > NUMA node in order to optimise overall performance.  Why should the
> > > application's desires take precedence over the kernel's desires?  And why
> > > should it be done this way rather than by the sysadmin using numactl to
> > > lock the application to a particular node?
> >
> > NUMA balancer is just one of many reasons for page migration. Even one
> > simple alloc_pages() can cause memory migration in just single NUMA
> > node or UMA system.
> >
> > The other reasons for page migration include but are not limited to:
> > * memory move due to CMA
> > * memory move due to huge pages creation
> >
> > Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
> > in the whole system.
> 
> You're dodging the question.  Should the CMA allocation fail because
> another application is using SVA?
> 
> I would say no.  

I would say no as well.

While IOMMU is enabled, CMA almost has one user only: IOMMU driver
as other drivers will depend on iommu to use non-contiguous memory
though they are still calling dma_alloc_coherent().

In iommu driver, dma_alloc_coherent is called during initialization
and there is no new allocation afterwards. So it wouldn't cause
runtime impact on SVA performance. Even there is new allocations,
CMA will fall back to general alloc_pages() and iommu drivers are
almost allocating small memory for command queues.

So I would say general compound pages, huge pages, especially
transparent huge pages, would be bigger concerns than CMA for
internal page migration within one NUMA. 

Not like CMA, general alloc_pages() can get memory by moving
pages other than those pinned.

And there is no guarantee we can always bind the memory of
SVA applications to single one NUMA, so NUMA balancing is
still a concern.

But I agree we need a way to make CMA success while the userspace
pages are pinned. Since pin has been viral in many drivers, I
assume there is a way to handle this. Otherwise, APIs like 
V4L2_MEMORY_USERPTR[1] will possibly make CMA fail as there
is no guarantee that usersspace will allocate unmovable memory
and there is no guarantee the fallback path- alloc_pages() can
succeed while allocating big memory.

Will investigate more.

> The application using SVA should take the one-time
> performance hit from having its memory moved around.

Sometimes I also feel SVA is doomed to suffer from performance
impact due to page migration. But we are still trying to
extend its use cases to high-performance I/O.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/media/v4l2-core/videobuf-dma-sg.c

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-07 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Matthew Wilcox [mailto:wi...@infradead.org]
> Sent: Monday, February 8, 2021 10:34 AM
> To: Wangzhou (B) 
> Cc: linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org;
> linux...@kvack.org; linux-arm-ker...@lists.infradead.org;
> linux-...@vger.kernel.org; Andrew Morton ;
> Alexander Viro ; gre...@linuxfoundation.org; Song
> Bao Hua (Barry Song) ; j...@ziepe.ca;
> kevin.t...@intel.com; jean-phili...@linaro.org; eric.au...@redhat.com;
> Liguozhu (Kenneth) ; zhangfei@linaro.org;
> chensihang (A) 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Sun, Feb 07, 2021 at 04:18:03PM +0800, Zhou Wang wrote:
> > SVA(share virtual address) offers a way for device to share process virtual
> > address space safely, which makes more convenient for user space device
> > driver coding. However, IO page faults may happen when doing DMA
> > operations. As the latency of IO page fault is relatively big, DMA
> > performance will be affected severely when there are IO page faults.
> > >From a long term view, DMA performance will be not stable.
> >
> > In high-performance I/O cases, accelerators might want to perform
> > I/O on a memory without IO page faults which can result in dramatically
> > increased latency. Current memory related APIs could not achieve this
> > requirement, e.g. mlock can only avoid memory to swap to backup device,
> > page migration can still trigger IO page fault.
> 
> Well ... we have two requirements.  The application wants to not take
> page faults.  The system wants to move the application to a different
> NUMA node in order to optimise overall performance.  Why should the
> application's desires take precedence over the kernel's desires?  And why
> should it be done this way rather than by the sysadmin using numactl to
> lock the application to a particular node?

NUMA balancer is just one of many reasons for page migration. Even one
simple alloc_pages() can cause memory migration in just single NUMA
node or UMA system.

The other reasons for page migration include but are not limited to:
* memory move due to CMA
* memory move due to huge pages creation

Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
in the whole system.

On the other hand, numactl doesn't always bind memory to single NUMA
node, sometimes, while applications require many cpu, it could bind
more than one memory node.

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 2/2] dma-mapping: benchmark: pretend DMA is transmitting

2021-02-05 Thread Barry Song
In a real dma mapping user case, after dma_map is done, data will be
transmit. Thus, in multi-threaded user scenario, IOMMU contention
should not be that severe. For example, if users enable multiple
threads to send network packets through 1G/10G/100Gbps NIC, usually
the steps will be: map -> transmission -> unmap.  Transmission delay
reduces the contention of IOMMU.

Here a delay is added to simulate the transmission between map and unmap
so that the tested result could be more accurate for TX and simple RX.
A typical TX transmission for NIC would be like: map -> TX -> unmap
since the socket buffers come from OS. Simple RX model eg. disk driver,
is also map -> RX -> unmap, but real RX model in a NIC could be more
complicated considering packets can come spontaneously and many drivers
are using pre-mapped buffers pool. This is in the TBD list.

Signed-off-by: Barry Song 
---
 kernel/dma/map_benchmark.c| 12 ++-
 .../testing/selftests/dma/dma_map_benchmark.c | 21 ---
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
index da95df381483..e0e64f8b0739 100644
--- a/kernel/dma/map_benchmark.c
+++ b/kernel/dma/map_benchmark.c
@@ -21,6 +21,7 @@
 #define DMA_MAP_BENCHMARK  _IOWR('d', 1, struct map_benchmark)
 #define DMA_MAP_MAX_THREADS1024
 #define DMA_MAP_MAX_SECONDS300
+#define DMA_MAP_MAX_TRANS_DELAY(10 * NSEC_PER_MSEC)
 
 #define DMA_MAP_BIDIRECTIONAL  0
 #define DMA_MAP_TO_DEVICE  1
@@ -36,7 +37,8 @@ struct map_benchmark {
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
-   __u8 expansion[84]; /* For future use */
+   __u32 dma_trans_ns; /* time for DMA transmission in ns */
+   __u8 expansion[80]; /* For future use */
 };
 
 struct map_benchmark_data {
@@ -87,6 +89,9 @@ static int map_benchmark_thread(void *data)
map_etime = ktime_get();
map_delta = ktime_sub(map_etime, map_stime);
 
+   /* Pretend DMA is transmitting */
+   ndelay(map->bparam.dma_trans_ns);
+
unmap_stime = ktime_get();
dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, map->dir);
unmap_etime = ktime_get();
@@ -218,6 +223,11 @@ static long map_benchmark_ioctl(struct file *file, 
unsigned int cmd,
return -EINVAL;
}
 
+   if (map->bparam.dma_trans_ns > DMA_MAP_MAX_TRANS_DELAY) {
+   pr_err("invalid transmission delay\n");
+   return -EINVAL;
+   }
+
if (map->bparam.node != NUMA_NO_NODE &&
!node_possible(map->bparam.node)) {
pr_err("invalid numa node\n");
diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c 
b/tools/testing/selftests/dma/dma_map_benchmark.c
index 537d65968c48..fb23ce9617ea 100644
--- a/tools/testing/selftests/dma/dma_map_benchmark.c
+++ b/tools/testing/selftests/dma/dma_map_benchmark.c
@@ -12,9 +12,12 @@
 #include 
 #include 
 
+#define NSEC_PER_MSEC  100L
+
 #define DMA_MAP_BENCHMARK  _IOWR('d', 1, struct map_benchmark)
 #define DMA_MAP_MAX_THREADS1024
 #define DMA_MAP_MAX_SECONDS 300
+#define DMA_MAP_MAX_TRANS_DELAY(10 * NSEC_PER_MSEC)
 
 #define DMA_MAP_BIDIRECTIONAL  0
 #define DMA_MAP_TO_DEVICE  1
@@ -36,7 +39,8 @@ struct map_benchmark {
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
-   __u8 expansion[84]; /* For future use */
+   __u32 dma_trans_ns; /* time for DMA transmission in ns */
+   __u8 expansion[80]; /* For future use */
 };
 
 int main(int argc, char **argv)
@@ -46,12 +50,12 @@ int main(int argc, char **argv)
/* default single thread, run 20 seconds on NUMA_NO_NODE */
int threads = 1, seconds = 20, node = -1;
/* default dma mask 32bit, bidirectional DMA */
-   int bits = 32, dir = DMA_MAP_BIDIRECTIONAL;
+   int bits = 32, xdelay = 0, dir = DMA_MAP_BIDIRECTIONAL;
 
int cmd = DMA_MAP_BENCHMARK;
char *p;
 
-   while ((opt = getopt(argc, argv, "t:s:n:b:d:")) != -1) {
+   while ((opt = getopt(argc, argv, "t:s:n:b:d:x:")) != -1) {
switch (opt) {
case 't':
threads = atoi(optarg);
@@ -68,6 +72,9 @@ int main(int argc, char **argv)
case 'd':
dir = atoi(optarg);
break;
+   case 'x':
+   xdelay = atoi(optarg);
+   break;
default:
return -1;

[PATCH v3 1/2] dma-mapping: benchmark: use u8 for reserved field in uAPI structure

2021-02-05 Thread Barry Song
The original code put five u32 before a u64 expansion[10] array. Five is
odd, this will cause trouble in the extension of the structure by adding
new features. This patch moves to use u8 for reserved field to avoid
future alignment risk.
Meanwhile, it also clears the memory of struct map_benchmark in tools,
otherwise, if users use old version to run on newer kernel, the random
expansion value will cause side effect on newer kernel.

Signed-off-by: Barry Song 
---
 kernel/dma/map_benchmark.c  | 2 +-
 tools/testing/selftests/dma/dma_map_benchmark.c | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
index 1b1b8ff875cb..da95df381483 100644
--- a/kernel/dma/map_benchmark.c
+++ b/kernel/dma/map_benchmark.c
@@ -36,7 +36,7 @@ struct map_benchmark {
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
-   __u64 expansion[10];/* For future use */
+   __u8 expansion[84]; /* For future use */
 };
 
 struct map_benchmark_data {
diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c 
b/tools/testing/selftests/dma/dma_map_benchmark.c
index 7065163a8388..537d65968c48 100644
--- a/tools/testing/selftests/dma/dma_map_benchmark.c
+++ b/tools/testing/selftests/dma/dma_map_benchmark.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -35,7 +36,7 @@ struct map_benchmark {
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
-   __u64 expansion[10];/* For future use */
+   __u8 expansion[84]; /* For future use */
 };
 
 int main(int argc, char **argv)
@@ -102,6 +103,7 @@ int main(int argc, char **argv)
exit(1);
}
 
+   memset(, 0, sizeof(map));
map.seconds = seconds;
map.threads = threads;
map.node = node;
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2] dma-mapping: benchmark: pretend DMA is transmitting

2021-02-05 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Friday, February 5, 2021 11:36 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Christoph Hellwig ; m.szyprow...@samsung.com;
> robin.mur...@arm.com; iommu@lists.linux-foundation.org;
> linux-ker...@vger.kernel.org; linux...@openeuler.org
> Subject: Re: [PATCH v2] dma-mapping: benchmark: pretend DMA is transmitting
> 
> On Fri, Feb 05, 2021 at 10:32:26AM +, Song Bao Hua (Barry Song) wrote:
> > I can keep the struct size unchanged by changing the struct to
> >
> > struct map_benchmark {
> > __u64 avg_map_100ns; /* average map latency in 100ns */
> > __u64 map_stddev; /* standard deviation of map latency */
> > __u64 avg_unmap_100ns; /* as above */
> > __u64 unmap_stddev;
> > __u32 threads; /* how many threads will do map/unmap in parallel */
> > __u32 seconds; /* how long the test will last */
> > __s32 node; /* which numa node this benchmark will run on */
> > __u32 dma_bits; /* DMA addressing capability */
> > __u32 dma_dir; /* DMA data direction */
> > __u32 dma_trans_ns; /* time for DMA transmission in ns */
> >
> > __u32 exp; /* For future use */
> > __u64 expansion[9]; /* For future use */
> > };
> >
> > But the code is really ugly now.
> 
> Thats why we usually use __u8 fields for reserved field.  You might
> consider just switching to that instead while you're at it. I guess
> we'll just have to get the addition into 5.11 then to make sure we
> don't release a kernel with the alignment fix.

I assume there is no need to keep the same size with 5.11-rc, so
could change the struct to:

struct map_benchmark {
__u64 avg_map_100ns; /* average map latency in 100ns */
__u64 map_stddev; /* standard deviation of map latency */
__u64 avg_unmap_100ns; /* as above */
__u64 unmap_stddev;
__u32 threads; /* how many threads will do map/unmap in parallel */
__u32 seconds; /* how long the test will last */
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
__u8 expansion[84]; /* For future use */
};

This won't increase size on 64bit system, but it increases 4bytes
on 32bits system comparing to 5.11-rc. How do you think about it?

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2] dma-mapping: benchmark: pretend DMA is transmitting

2021-02-05 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Friday, February 5, 2021 10:21 PM
> To: Song Bao Hua (Barry Song) 
> Cc: m.szyprow...@samsung.com; h...@lst.de; robin.mur...@arm.com;
> iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org;
> linux...@openeuler.org
> Subject: Re: [PATCH v2] dma-mapping: benchmark: pretend DMA is transmitting
> 
> On Fri, Feb 05, 2021 at 03:00:35PM +1300, Barry Song wrote:
> > +   __u32 dma_trans_ns; /* time for DMA transmission in ns */
> > __u64 expansion[10];/* For future use */
> 
> We need to keep the struct size, so the expansion field needs to
> shrink by the equivalent amount of data that is added in dma_trans_ns.

Unfortunately I didn't put a rsv u32 field after dma_dir
in the original patch.
There were five 32bits data before expansion[]:

struct map_benchmark {
__u64 avg_map_100ns; /* average map latency in 100ns */
__u64 map_stddev; /* standard deviation of map latency */
__u64 avg_unmap_100ns; /* as above */
__u64 unmap_stddev;
__u32 threads; /* how many threads will do map/unmap in parallel */
__u32 seconds; /* how long the test will last */
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
__u64 expansion[10];/* For future use */
};

My bad. That was really silly. I should have done the below from
the first beginning:
struct map_benchmark {
__u64 avg_map_100ns; /* average map latency in 100ns */
__u64 map_stddev; /* standard deviation of map latency */
__u64 avg_unmap_100ns; /* as above */
__u64 unmap_stddev;
__u32 threads; /* how many threads will do map/unmap in parallel */
__u32 seconds; /* how long the test will last */
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
__u32 rsv;
__u64 expansion[10];/* For future use */
};

So on 64bit system, this patch doesn't change the length of struct
as the new added u32 just fill the gap between dma_dir and expansion.

For 32bit system, this patch increases 4 bytes in the length.

I can keep the struct size unchanged by changing the struct to

struct map_benchmark {
__u64 avg_map_100ns; /* average map latency in 100ns */
__u64 map_stddev; /* standard deviation of map latency */
__u64 avg_unmap_100ns; /* as above */
__u64 unmap_stddev;
__u32 threads; /* how many threads will do map/unmap in parallel */
__u32 seconds; /* how long the test will last */
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
__u32 dma_trans_ns; /* time for DMA transmission in ns */

__u32 exp; /* For future use */
__u64 expansion[9]; /* For future use */
};

But the code is really ugly now.

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2] dma-mapping: benchmark: pretend DMA is transmitting

2021-02-04 Thread Barry Song
In a real dma mapping user case, after dma_map is done, data will be
transmit. Thus, in multi-threaded user scenario, IOMMU contention
should not be that severe. For example, if users enable multiple
threads to send network packets through 1G/10G/100Gbps NIC, usually
the steps will be: map -> transmission -> unmap.  Transmission delay
reduces the contention of IOMMU.

Here a delay is added to simulate the transmission between map and unmap
so that the tested result could be more accurate for TX and simple RX.
A typical TX transmission for NIC would be like: map -> TX -> unmap
since the socket buffers come from OS. Simple RX model eg. disk driver,
is also map -> RX -> unmap, but real RX model in a NIC could be more
complicated considering packets can come spontaneously and many drivers
are using pre-mapped buffers pool. This is in the TBD list.

Signed-off-by: Barry Song 
---
 -v2: cleanup according to Robin's feedback. thanks, Robin.

 kernel/dma/map_benchmark.c| 10 ++
 .../testing/selftests/dma/dma_map_benchmark.c | 19 +--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
index 1b1b8ff875cb..06636406a245 100644
--- a/kernel/dma/map_benchmark.c
+++ b/kernel/dma/map_benchmark.c
@@ -21,6 +21,7 @@
 #define DMA_MAP_BENCHMARK  _IOWR('d', 1, struct map_benchmark)
 #define DMA_MAP_MAX_THREADS1024
 #define DMA_MAP_MAX_SECONDS300
+#define DMA_MAP_MAX_TRANS_DELAY(10 * NSEC_PER_MSEC) /* 10ms */
 
 #define DMA_MAP_BIDIRECTIONAL  0
 #define DMA_MAP_TO_DEVICE  1
@@ -36,6 +37,7 @@ struct map_benchmark {
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
+   __u32 dma_trans_ns; /* time for DMA transmission in ns */
__u64 expansion[10];/* For future use */
 };
 
@@ -87,6 +89,9 @@ static int map_benchmark_thread(void *data)
map_etime = ktime_get();
map_delta = ktime_sub(map_etime, map_stime);
 
+   /* Pretend DMA is transmitting */
+   ndelay(map->bparam.dma_trans_ns);
+
unmap_stime = ktime_get();
dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, map->dir);
unmap_etime = ktime_get();
@@ -218,6 +223,11 @@ static long map_benchmark_ioctl(struct file *file, 
unsigned int cmd,
return -EINVAL;
}
 
+   if (map->bparam.dma_trans_ns > DMA_MAP_MAX_TRANS_DELAY) {
+   pr_err("invalid transmission delay\n");
+   return -EINVAL;
+   }
+
if (map->bparam.node != NUMA_NO_NODE &&
!node_possible(map->bparam.node)) {
pr_err("invalid numa node\n");
diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c 
b/tools/testing/selftests/dma/dma_map_benchmark.c
index 7065163a8388..a370290d9503 100644
--- a/tools/testing/selftests/dma/dma_map_benchmark.c
+++ b/tools/testing/selftests/dma/dma_map_benchmark.c
@@ -11,9 +11,12 @@
 #include 
 #include 
 
+#define NSEC_PER_MSEC  100L
+
 #define DMA_MAP_BENCHMARK  _IOWR('d', 1, struct map_benchmark)
 #define DMA_MAP_MAX_THREADS1024
 #define DMA_MAP_MAX_SECONDS 300
+#define DMA_MAP_MAX_TRANS_DELAY(10 * NSEC_PER_MSEC) /* 10ms */
 
 #define DMA_MAP_BIDIRECTIONAL  0
 #define DMA_MAP_TO_DEVICE  1
@@ -35,6 +38,7 @@ struct map_benchmark {
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
+   __u32 dma_trans_ns; /* delay for DMA transmission in ns */
__u64 expansion[10];/* For future use */
 };
 
@@ -45,12 +49,12 @@ int main(int argc, char **argv)
/* default single thread, run 20 seconds on NUMA_NO_NODE */
int threads = 1, seconds = 20, node = -1;
/* default dma mask 32bit, bidirectional DMA */
-   int bits = 32, dir = DMA_MAP_BIDIRECTIONAL;
+   int bits = 32, xdelay = 0, dir = DMA_MAP_BIDIRECTIONAL;
 
int cmd = DMA_MAP_BENCHMARK;
char *p;
 
-   while ((opt = getopt(argc, argv, "t:s:n:b:d:")) != -1) {
+   while ((opt = getopt(argc, argv, "t:s:n:b:d:x:")) != -1) {
switch (opt) {
case 't':
threads = atoi(optarg);
@@ -67,6 +71,9 @@ int main(int argc, char **argv)
case 'd':
dir = atoi(optarg);
break;
+   case 'x':
+   xdelay = atoi(optarg);
+   break;
default:
return -1;
}
@@ -84,6 +91,12 @@ int main(int argc, char **argv)

RE: [PATCH] dma-mapping: benchmark: pretend DMA is transmitting

2021-02-04 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Friday, February 5, 2021 12:51 PM
> To: Song Bao Hua (Barry Song) ;
> m.szyprow...@samsung.com; h...@lst.de; iommu@lists.linux-foundation.org
> Cc: linux-ker...@vger.kernel.org; linux...@openeuler.org
> Subject: Re: [PATCH] dma-mapping: benchmark: pretend DMA is transmitting
> 
> On 2021-02-04 22:58, Barry Song wrote:
> > In a real dma mapping user case, after dma_map is done, data will be
> > transmit. Thus, in multi-threaded user scenario, IOMMU contention
> > should not be that severe. For example, if users enable multiple
> > threads to send network packets through 1G/10G/100Gbps NIC, usually
> > the steps will be: map -> transmission -> unmap.  Transmission delay
> > reduces the contention of IOMMU. Here a delay is added to simulate
> > the transmission for TX case so that the tested result could be
> > more accurate.
> >
> > RX case would be much more tricky. It is not supported yet.
> 
> I guess it might be a reasonable approximation to map several pages,
> then unmap them again after a slightly more random delay. Or maybe
> divide the threads into pairs of mappers and unmappers respectively
> filling up and draining proper little buffer pools.

Yes. Good suggestions. I am actually thinking about how to support
cases like networks. There is a pre-mapped list of pages, each page
is bound with some hardware DMA block descriptor(BD). So if Linux can
consume the packets in time, those buffers are always re-used. Only
when the page bound with BD is full and OS can't consume it in time,
another temp page will be allocated and mapped, BD will switch to use
this temp page, then finally unmap it if it is not needed any more.
On the other hand, the pre-mapped pages are never unmapped.

For things like filesystem and disk driver, RX is always requested by
users. The model would be simpler: map -> rx -> unmap. For networks,
RX transmission can come spontaneously.

Anyway, I'll put this into TBD. For this moment, mainly handle TX path.
Or maybe the current code has been able to handle simple RX model :-)

> 
> > Signed-off-by: Barry Song 
> > ---
> >   kernel/dma/map_benchmark.c  | 11 +++
> >   tools/testing/selftests/dma/dma_map_benchmark.c | 17 +++--
> >   2 files changed, 26 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> > index 1b1b8ff875cb..1976db7e34e4 100644
> > --- a/kernel/dma/map_benchmark.c
> > +++ b/kernel/dma/map_benchmark.c
> > @@ -21,6 +21,7 @@
> >   #define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark)
> >   #define DMA_MAP_MAX_THREADS   1024
> >   #define DMA_MAP_MAX_SECONDS   300
> > +#define DMA_MAP_MAX_TRANS_DELAY(10 * 1000 * 1000) /* 10ms */
> 
> Using MSEC_PER_SEC might be sufficiently self-documenting?

Yes, I guess you mean NSEC_PER_MSEC. will move to it.

> 
> >   #define DMA_MAP_BIDIRECTIONAL 0
> >   #define DMA_MAP_TO_DEVICE 1
> > @@ -36,6 +37,7 @@ struct map_benchmark {
> > __s32 node; /* which numa node this benchmark will run on */
> > __u32 dma_bits; /* DMA addressing capability */
> > __u32 dma_dir; /* DMA data direction */
> > +   __u32 dma_trans_ns; /* time for DMA transmission in ns */
> > __u64 expansion[10];/* For future use */
> >   };
> >
> > @@ -87,6 +89,10 @@ static int map_benchmark_thread(void *data)
> > map_etime = ktime_get();
> > map_delta = ktime_sub(map_etime, map_stime);
> >
> > +   /* Pretend DMA is transmitting */
> > +   if (map->dir != DMA_FROM_DEVICE)
> > +   ndelay(map->bparam.dma_trans_ns);
> 
> TBH I think the option of a fixed delay between map and unmap might be a
> handy thing in general, so having the direction check at all seems
> needlessly restrictive. As long as the driver implements all the basic
> building blocks, combining them to simulate specific traffic patterns
> can be left up to the benchmark tool.

Sensible, will remove the condition check.

> 
> Robin.
> 
> > +
> > unmap_stime = ktime_get();
> > dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, map->dir);
> > unmap_etime = ktime_get();
> > @@ -218,6 +224,11 @@ static long map_benchmark_ioctl(struct file *file,
> unsigned int cmd,
> > return -EINVAL;
> > }
> >
> > +   if (map->bparam.dma_trans_ns > DMA_MAP_MAX_TRANS_DELAY) {
> > +   pr_err("invalid transmission delay\n&q

[PATCH] dma-mapping: benchmark: pretend DMA is transmitting

2021-02-04 Thread Barry Song
In a real dma mapping user case, after dma_map is done, data will be
transmit. Thus, in multi-threaded user scenario, IOMMU contention
should not be that severe. For example, if users enable multiple
threads to send network packets through 1G/10G/100Gbps NIC, usually
the steps will be: map -> transmission -> unmap.  Transmission delay
reduces the contention of IOMMU. Here a delay is added to simulate
the transmission for TX case so that the tested result could be
more accurate.

RX case would be much more tricky. It is not supported yet.

Signed-off-by: Barry Song 
---
 kernel/dma/map_benchmark.c  | 11 +++
 tools/testing/selftests/dma/dma_map_benchmark.c | 17 +++--
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
index 1b1b8ff875cb..1976db7e34e4 100644
--- a/kernel/dma/map_benchmark.c
+++ b/kernel/dma/map_benchmark.c
@@ -21,6 +21,7 @@
 #define DMA_MAP_BENCHMARK  _IOWR('d', 1, struct map_benchmark)
 #define DMA_MAP_MAX_THREADS1024
 #define DMA_MAP_MAX_SECONDS300
+#define DMA_MAP_MAX_TRANS_DELAY(10 * 1000 * 1000) /* 10ms */
 
 #define DMA_MAP_BIDIRECTIONAL  0
 #define DMA_MAP_TO_DEVICE  1
@@ -36,6 +37,7 @@ struct map_benchmark {
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
+   __u32 dma_trans_ns; /* time for DMA transmission in ns */
__u64 expansion[10];/* For future use */
 };
 
@@ -87,6 +89,10 @@ static int map_benchmark_thread(void *data)
map_etime = ktime_get();
map_delta = ktime_sub(map_etime, map_stime);
 
+   /* Pretend DMA is transmitting */
+   if (map->dir != DMA_FROM_DEVICE)
+   ndelay(map->bparam.dma_trans_ns);
+
unmap_stime = ktime_get();
dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, map->dir);
unmap_etime = ktime_get();
@@ -218,6 +224,11 @@ static long map_benchmark_ioctl(struct file *file, 
unsigned int cmd,
return -EINVAL;
}
 
+   if (map->bparam.dma_trans_ns > DMA_MAP_MAX_TRANS_DELAY) {
+   pr_err("invalid transmission delay\n");
+   return -EINVAL;
+   }
+
if (map->bparam.node != NUMA_NO_NODE &&
!node_possible(map->bparam.node)) {
pr_err("invalid numa node\n");
diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c 
b/tools/testing/selftests/dma/dma_map_benchmark.c
index 7065163a8388..dbf426e2fb7f 100644
--- a/tools/testing/selftests/dma/dma_map_benchmark.c
+++ b/tools/testing/selftests/dma/dma_map_benchmark.c
@@ -14,6 +14,7 @@
 #define DMA_MAP_BENCHMARK  _IOWR('d', 1, struct map_benchmark)
 #define DMA_MAP_MAX_THREADS1024
 #define DMA_MAP_MAX_SECONDS 300
+#define DMA_MAP_MAX_TRANS_DELAY(10 * 1000 * 1000) /* 10ms */
 
 #define DMA_MAP_BIDIRECTIONAL  0
 #define DMA_MAP_TO_DEVICE  1
@@ -35,6 +36,7 @@ struct map_benchmark {
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
+   __u32 dma_trans_ns; /* delay for DMA transmission in ns */
__u64 expansion[10];/* For future use */
 };
 
@@ -45,12 +47,12 @@ int main(int argc, char **argv)
/* default single thread, run 20 seconds on NUMA_NO_NODE */
int threads = 1, seconds = 20, node = -1;
/* default dma mask 32bit, bidirectional DMA */
-   int bits = 32, dir = DMA_MAP_BIDIRECTIONAL;
+   int bits = 32, xdelay = 0, dir = DMA_MAP_BIDIRECTIONAL;
 
int cmd = DMA_MAP_BENCHMARK;
char *p;
 
-   while ((opt = getopt(argc, argv, "t:s:n:b:d:")) != -1) {
+   while ((opt = getopt(argc, argv, "t:s:n:b:d:x:")) != -1) {
switch (opt) {
case 't':
threads = atoi(optarg);
@@ -67,6 +69,9 @@ int main(int argc, char **argv)
case 'd':
dir = atoi(optarg);
break;
+   case 'x':
+   xdelay = atoi(optarg);
+   break;
default:
return -1;
}
@@ -84,6 +89,12 @@ int main(int argc, char **argv)
exit(1);
}
 
+   if (xdelay < 0 || xdelay > DMA_MAP_MAX_TRANS_DELAY) {
+   fprintf(stderr, "invalid transmit delay, must be in 0-%d\n",
+   DMA_MAP_MAX_TRANS_DELAY);
+   exit(1);
+   }
+
/* suppose the mininum DMA zone is 1MB in the world */
if (bits < 20 || bits > 64) {
fprintf(std

RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-02-01 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Tian, Kevin [mailto:kevin.t...@intel.com]
> Sent: Tuesday, February 2, 2021 3:52 PM
> To: Jason Gunthorpe 
> Cc: Song Bao Hua (Barry Song) ; chensihang (A)
> ; Arnd Bergmann ; Greg
> Kroah-Hartman ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org; Zhangfei Gao
> ; Liguozhu (Kenneth) ;
> linux-accelerat...@lists.ozlabs.org
> Subject: RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> > From: Jason Gunthorpe 
> > Sent: Tuesday, February 2, 2021 7:44 AM
> >
> > On Fri, Jan 29, 2021 at 10:09:03AM +, Tian, Kevin wrote:
> > > > SVA is not doom to work with IO page fault only. If we have SVA+pin,
> > > > we would get both sharing address and stable I/O latency.
> > >
> > > Isn't it like a traditional MAP_DMA API (imply pinning) plus specifying
> > > cpu_va of the memory pool as the iova?
> >
> > I think their issue is the HW can't do the cpu_va trick without also
> > involving the system IOMMU in a SVA mode
> >
> 
> This is the part that I didn't understand. Using cpu_va in a MAP_DMA
> interface doesn't require device support. It's just an user-specified
> address to be mapped into the IOMMU page table. On the other hand,

The background is that uacce is based on SVA and we are building
applications on uacce:
https://www.kernel.org/doc/html/v5.10/misc-devices/uacce.html
so IOMMU simply uses the page table of MMU, and don't do any
special mapping to an user-specified address. We don't break
the basic assumption that uacce is using SVA, otherwise, we
need to re-build uacce and the whole base.

> sharing CPU page table through a SVA interface for an usage where I/O
> page faults must be completely avoided seems a misleading attempt.

That is not for completely avoiding IO page fault, that is just
an extension for high-performance I/O case, providing a way to
avoid IO latency jitter. Using it or not is totally up to users.

> Even if people do want this model (e.g. mix pinning+fault), it should be
> a mm syscall as Greg pointed out, not specific to sva.
> 

We are glad to make it a syscall if people are happy with
it. The simplest way would be a syscall similar with
userfaultfd  if we don't want to mess up mm_struct.

> Thanks
> Kevin

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-02-01 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Tuesday, February 2, 2021 12:44 PM
> To: Tian, Kevin 
> Cc: Song Bao Hua (Barry Song) ; chensihang (A)
> ; Arnd Bergmann ; Greg
> Kroah-Hartman ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org; Zhangfei Gao
> ; Liguozhu (Kenneth) ;
> linux-accelerat...@lists.ozlabs.org
> Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> On Fri, Jan 29, 2021 at 10:09:03AM +, Tian, Kevin wrote:
> > > SVA is not doom to work with IO page fault only. If we have SVA+pin,
> > > we would get both sharing address and stable I/O latency.
> >
> > Isn't it like a traditional MAP_DMA API (imply pinning) plus specifying
> > cpu_va of the memory pool as the iova?
> 
> I think their issue is the HW can't do the cpu_va trick without also
> involving the system IOMMU in a SVA mode
> 
> It really is something that belongs under some general /dev/sva as we
> talked on the vfio thread

AFAIK, there is no this /dev/sva so /dev/uacce is an uAPI
which belongs to sva.

Another option is that we add a system call like
fs/userfaultfd.c, and move the file_operations and  ioctl
to the anon inode by creating fd via anon_inode_getfd().
Then nothing will be buried by uacce.

> 
> Jason

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-01-29 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Tian, Kevin [mailto:kevin.t...@intel.com]
> Sent: Friday, January 29, 2021 11:09 PM
> To: Song Bao Hua (Barry Song) ; Jason Gunthorpe
> 
> Cc: chensihang (A) ; Arnd Bergmann
> ; Greg Kroah-Hartman ;
> linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org;
> linux...@kvack.org; Zhangfei Gao ; Liguozhu
> (Kenneth) ; linux-accelerat...@lists.ozlabs.org
> Subject: RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> > From: Song Bao Hua (Barry Song)
> > Sent: Tuesday, January 26, 2021 9:27 AM
> >
> > > -Original Message-
> > > From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> > > Sent: Tuesday, January 26, 2021 2:13 PM
> > > To: Song Bao Hua (Barry Song) 
> > > Cc: Wangzhou (B) ; Greg Kroah-Hartman
> > > ; Arnd Bergmann ;
> > Zhangfei Gao
> > > ; linux-accelerat...@lists.ozlabs.org;
> > > linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org;
> > > linux...@kvack.org; Liguozhu (Kenneth) ;
> > chensihang
> > > (A) 
> > > Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> > >
> > > On Mon, Jan 25, 2021 at 11:35:22PM +, Song Bao Hua (Barry Song)
> > wrote:
> > >
> > > > > On Mon, Jan 25, 2021 at 10:21:14PM +, Song Bao Hua (Barry Song)
> > wrote:
> > > > > > mlock, while certainly be able to prevent swapping out, it won't
> > > > > > be able to stop page moving due to:
> > > > > > * memory compaction in alloc_pages()
> > > > > > * making huge pages
> > > > > > * numa balance
> > > > > > * memory compaction in CMA
> > > > >
> > > > > Enabling those things is a major reason to have SVA device in the
> > > > > first place, providing a SW API to turn it all off seems like the
> > > > > wrong direction.
> > > >
> > > > I wouldn't say this is a major reason to have SVA. If we read the
> > > > history of SVA and papers, people would think easy programming due
> > > > to data struct sharing between cpu and device, and process space
> > > > isolation in device would be the major reasons for SVA. SVA also
> > > > declares it supports zero-copy while zero-copy doesn't necessarily
> > > > depend on SVA.
> > >
> > > Once you have to explicitly make system calls to declare memory under
> > > IO, you loose all of that.
> > >
> > > Since you've asked the app to be explicit about the DMAs it intends to
> > > do, there is not really much reason to use SVA for those DMAs anymore.
> >
> > Let's see a non-SVA case. We are not using SVA, we can have
> > a memory pool by hugetlb or pin, and app can allocate memory
> > from this pool, and get stable I/O performance on the memory
> > from the pool. But device has its separate page table which
> > is not bound with this process, thus lacking the protection
> > of process space isolation. Plus, CPU and device are using
> > different address.
> >
> > And then we move to SVA case, we can still have a memory pool
> > by hugetlb or pin, and app can allocate memory from this pool
> > since this pool is mapped to the address space of the process,
> > and we are able to get stable I/O performance since it is always
> > there. But in this case, device is using the page table of
> > process with the full permission control.
> > And they are using same address and can possibly enjoy the easy
> > programming if HW supports.
> >
> > SVA is not doom to work with IO page fault only. If we have SVA+pin,
> > we would get both sharing address and stable I/O latency.
> >
> 
> Isn't it like a traditional MAP_DMA API (imply pinning) plus specifying
> cpu_va of the memory pool as the iova?

I think it enjoys the advantage of stable I/O latency of
traditional MAP_DMA, and also uses the process page table
which SVA can provide. The major difference is that in
SVA case, iova totally belongs to process and is as normal
as other heap/stack/data:
p = mmap(.MAP_ANON);
ioctl(/dev/acc, p, PIN);

SVA for itself, provides the ability to guarantee the
address space isolation of multiple processes.  If the
device can access the data struct  such as list, tree
directly, they can further enjoy the convenience of
programming SVA gives.

So we are looking for a combination of stable io latency
of traditional DMA map and the ability of SVA.

> 
> Thanks
> Kevin

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-01-27 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Wednesday, January 27, 2021 7:20 AM
> To: Song Bao Hua (Barry Song) 
> Cc: Wangzhou (B) ; Greg Kroah-Hartman
> ; Arnd Bergmann ; Zhangfei Gao
> ; linux-accelerat...@lists.ozlabs.org;
> linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org;
> linux...@kvack.org; Liguozhu (Kenneth) ; chensihang
> (A) 
> Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> On Tue, Jan 26, 2021 at 01:26:45AM +, Song Bao Hua (Barry Song) wrote:
> > > On Mon, Jan 25, 2021 at 11:35:22PM +, Song Bao Hua (Barry Song) wrote:
> > >
> > > > > On Mon, Jan 25, 2021 at 10:21:14PM +, Song Bao Hua (Barry Song)
> wrote:
> > > > > > mlock, while certainly be able to prevent swapping out, it won't
> > > > > > be able to stop page moving due to:
> > > > > > * memory compaction in alloc_pages()
> > > > > > * making huge pages
> > > > > > * numa balance
> > > > > > * memory compaction in CMA
> > > > >
> > > > > Enabling those things is a major reason to have SVA device in the
> > > > > first place, providing a SW API to turn it all off seems like the
> > > > > wrong direction.
> > > >
> > > > I wouldn't say this is a major reason to have SVA. If we read the
> > > > history of SVA and papers, people would think easy programming due
> > > > to data struct sharing between cpu and device, and process space
> > > > isolation in device would be the major reasons for SVA. SVA also
> > > > declares it supports zero-copy while zero-copy doesn't necessarily
> > > > depend on SVA.
> > >
> > > Once you have to explicitly make system calls to declare memory under
> > > IO, you loose all of that.
> > >
> > > Since you've asked the app to be explicit about the DMAs it intends to
> > > do, there is not really much reason to use SVA for those DMAs anymore.
> >
> > Let's see a non-SVA case. We are not using SVA, we can have
> > a memory pool by hugetlb or pin, and app can allocate memory
> > from this pool, and get stable I/O performance on the memory
> > from the pool. But device has its separate page table which
> > is not bound with this process, thus lacking the protection
> > of process space isolation. Plus, CPU and device are using
> > different address.
> 
> So you are relying on the platform to do the SVA for the device?
> 

Sorry for late response.

uacce and its userspace framework UADK depend on SVA, leveraging
the enhanced security by isolated process address space.

This patch is mainly an extension for performance optimization to
get stable high-performance I/O on pinned memory even though the
hardware supports IO page fault to get pages back after swapping
out or page migration.
But IO page fault will cause serious latency jitter for high-speed
I/O.
For slow speed device, they don't need to use this extension.

> This feels like it goes back to another topic where I felt the SVA
> setup uAPI should be shared and not buried into every driver's unique
> ioctls.
> 
> Having something like this in a shared SVA system is somewhat less
> strange.

Sounds reasonable. On the other hand, uacce seems to be an common
uAPI for SVA, and probably the only one for this moment.

uacce is a framework not a specific driver as any accelerators
can hook into this framework as long as a device provides
uacce_ops and register itself by uacce_register(). Uacce, for
itself, doesn't bind with any specific hardware. So uacce interfaces
are kind of common uAPI :-)

> 
> Jason

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-01-25 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Tuesday, January 26, 2021 2:13 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Wangzhou (B) ; Greg Kroah-Hartman
> ; Arnd Bergmann ; Zhangfei Gao
> ; linux-accelerat...@lists.ozlabs.org;
> linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org;
> linux...@kvack.org; Liguozhu (Kenneth) ; chensihang
> (A) 
> Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> On Mon, Jan 25, 2021 at 11:35:22PM +, Song Bao Hua (Barry Song) wrote:
> 
> > > On Mon, Jan 25, 2021 at 10:21:14PM +, Song Bao Hua (Barry Song) wrote:
> > > > mlock, while certainly be able to prevent swapping out, it won't
> > > > be able to stop page moving due to:
> > > > * memory compaction in alloc_pages()
> > > > * making huge pages
> > > > * numa balance
> > > > * memory compaction in CMA
> > >
> > > Enabling those things is a major reason to have SVA device in the
> > > first place, providing a SW API to turn it all off seems like the
> > > wrong direction.
> >
> > I wouldn't say this is a major reason to have SVA. If we read the
> > history of SVA and papers, people would think easy programming due
> > to data struct sharing between cpu and device, and process space
> > isolation in device would be the major reasons for SVA. SVA also
> > declares it supports zero-copy while zero-copy doesn't necessarily
> > depend on SVA.
> 
> Once you have to explicitly make system calls to declare memory under
> IO, you loose all of that.
> 
> Since you've asked the app to be explicit about the DMAs it intends to
> do, there is not really much reason to use SVA for those DMAs anymore.

Let's see a non-SVA case. We are not using SVA, we can have
a memory pool by hugetlb or pin, and app can allocate memory
from this pool, and get stable I/O performance on the memory
from the pool. But device has its separate page table which
is not bound with this process, thus lacking the protection
of process space isolation. Plus, CPU and device are using
different address.

And then we move to SVA case, we can still have a memory pool
by hugetlb or pin, and app can allocate memory from this pool
since this pool is mapped to the address space of the process,
and we are able to get stable I/O performance since it is always
there. But in this case, device is using the page table of
process with the full permission control.
And they are using same address and can possibly enjoy the easy
programming if HW supports.

SVA is not doom to work with IO page fault only. If we have SVA+pin,
we would get both sharing address and stable I/O latency.

> 
> Jason

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-01-25 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of
> Jason Gunthorpe
> Sent: Tuesday, January 26, 2021 12:16 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Wangzhou (B) ; Greg Kroah-Hartman
> ; Arnd Bergmann ; Zhangfei Gao
> ; linux-accelerat...@lists.ozlabs.org;
> linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org;
> linux...@kvack.org; Liguozhu (Kenneth) ; chensihang
> (A) 
> Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> On Mon, Jan 25, 2021 at 10:21:14PM +, Song Bao Hua (Barry Song) wrote:
> > mlock, while certainly be able to prevent swapping out, it won't
> > be able to stop page moving due to:
> > * memory compaction in alloc_pages()
> > * making huge pages
> > * numa balance
> > * memory compaction in CMA
> 
> Enabling those things is a major reason to have SVA device in the
> first place, providing a SW API to turn it all off seems like the
> wrong direction.

I wouldn't say this is a major reason to have SVA. If we read the
history of SVA and papers, people would think easy programming due
to data struct sharing between cpu and device, and process space
isolation in device would be the major reasons for SVA. SVA also
declares it supports zero-copy while zero-copy doesn't necessarily
depend on SVA.

Page migration and I/O page fault overhead, on the other hand, would
probably be the major problems which block SVA becoming a 
high-performance and more popular solution.

> 
> If the device doesn't want to use SVA then don't use it, use normal
> DMA pinning like everything else.
> 

If we disable SVA, we won't get the benefits of SVA on address sharing,
and process space isolation.

> Jason

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-01-25 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Tuesday, January 26, 2021 4:47 AM
> To: Wangzhou (B) 
> Cc: Greg Kroah-Hartman ; Arnd Bergmann
> ; Zhangfei Gao ;
> linux-accelerat...@lists.ozlabs.org; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org; Song Bao Hua (Barry 
> Song)
> ; Liguozhu (Kenneth) ;
> chensihang (A) 
> Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> On Mon, Jan 25, 2021 at 04:34:56PM +0800, Zhou Wang wrote:
> 
> > +static int uacce_pin_page(struct uacce_pin_container *priv,
> > + struct uacce_pin_address *addr)
> > +{
> > +   unsigned int flags = FOLL_FORCE | FOLL_WRITE;
> > +   unsigned long first, last, nr_pages;
> > +   struct page **pages;
> > +   struct pin_pages *p;
> > +   int ret;
> > +
> > +   first = (addr->addr & PAGE_MASK) >> PAGE_SHIFT;
> > +   last = ((addr->addr + addr->size - 1) & PAGE_MASK) >> PAGE_SHIFT;
> > +   nr_pages = last - first + 1;
> > +
> > +   pages = vmalloc(nr_pages * sizeof(struct page *));
> > +   if (!pages)
> > +   return -ENOMEM;
> > +
> > +   p = kzalloc(sizeof(*p), GFP_KERNEL);
> > +   if (!p) {
> > +   ret = -ENOMEM;
> > +   goto free;
> > +   }
> > +
> > +   ret = pin_user_pages_fast(addr->addr & PAGE_MASK, nr_pages,
> > + flags | FOLL_LONGTERM, pages);
> 
> This needs to copy the RLIMIT_MEMLOCK and can_do_mlock() stuff from
> other places, like ib_umem_get
> 
> > +   ret = xa_err(xa_store(>array, p->first, p, GFP_KERNEL));
> 
> And this is really weird, I don't think it makes sense to make handles
> for DMA based on the starting VA.
> 
> > +static int uacce_unpin_page(struct uacce_pin_container *priv,
> > +   struct uacce_pin_address *addr)
> > +{
> > +   unsigned long first, last, nr_pages;
> > +   struct pin_pages *p;
> > +
> > +   first = (addr->addr & PAGE_MASK) >> PAGE_SHIFT;
> > +   last = ((addr->addr + addr->size - 1) & PAGE_MASK) >> PAGE_SHIFT;
> > +   nr_pages = last - first + 1;
> > +
> > +   /* find pin_pages */
> > +   p = xa_load(>array, first);
> > +   if (!p)
> > +   return -ENODEV;
> > +
> > +   if (p->nr_pages != nr_pages)
> > +   return -EINVAL;
> > +
> > +   /* unpin */
> > +   unpin_user_pages(p->pages, p->nr_pages);
> 
> And unpinning without guaranteeing there is no ongoing DMA is really
> weird

In SVA case, kernel has no idea if accelerators are accessing
the memory so I would assume SVA has a method to prevent
the pages being transferred from migration or release. Otherwise,
SVA will crash easily in a system with high memory pressure.

Anyway, This is a problem worth further investigating.

> 
> Are you abusing this in conjunction with a SVA scheme just to prevent
> page motion? Why wasn't mlock good enough?

Page migration won't cause any disfunction in SVA case as IO page
fault will get a valid page again. It is only a performance issue
as IO page fault has larger latency than the usual page fault,
would be 3-80´é┤slower than page fault[1]

mlock, while certainly be able to prevent swapping out, it won't
be able to stop page moving due to:
* memory compaction in alloc_pages()
* making huge pages
* numa balance
* memory compaction in CMA
etc.

[1] https://ieeexplore.ieee.org/stamp/stamp.jsp?tp==7482091=1
> 
> Jason

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH RESEND] dma-mapping: benchmark: fix kernel crash when dma_map_single fails

2021-01-24 Thread Barry Song
if dma_map_single() fails, kernel will give the below oops since
task_struct has been destroyed and we are running into the memory
corruption due to use-after-free in kthread_stop():

[   48.095310] Unable to handle kernel paging request at virtual address 
00c473548040
[   48.095736] Mem abort info:
[   48.095864]   ESR = 0x9604
[   48.096025]   EC = 0x25: DABT (current EL), IL = 32 bits
[   48.096268]   SET = 0, FnV = 0
[   48.096401]   EA = 0, S1PTW = 0
[   48.096538] Data abort info:
[   48.096659]   ISV = 0, ISS = 0x0004
[   48.096820]   CM = 0, WnR = 0
[   48.097079] user pgtable: 4k pages, 48-bit VAs, pgdp=000104639000
[   48.098099] [00c473548040] pgd=, p4d=
[   48.098832] Internal error: Oops: 9604 [#1] PREEMPT SMP
[   48.099232] Modules linked in:
[   48.099387] CPU: 0 PID: 2 Comm: kthreadd Tainted: GW
[   48.099887] Hardware name: linux,dummy-virt (DT)
[   48.100078] pstate: 6005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
[   48.100516] pc : __kmalloc_node+0x214/0x368
[   48.100944] lr : __kmalloc_node+0x1f4/0x368
[   48.101458] sp : 800011f0bb80
[   48.101843] x29: 800011f0bb80 x28: c0098ec0
[   48.102330] x27:  x26: 001d4600
[   48.102648] x25: c0098ec0 x24: 800011b6a000
[   48.102988] x23:  x22: c0098ec0
[   48.10] x21: 8000101d7a54 x20: 0dc0
[   48.103657] x19: c0001e00 x18: 
[   48.104069] x17:  x16: 
[   48.105449] x15: 01aa0304e7b9 x14: 03b1
[   48.106401] x13: 8000122d5000 x12: 80001228d000
[   48.107296] x11: c0154340 x10: 
[   48.107862] x9 : 8fff x8 : c473527f
[   48.108326] x7 : 800011e62f58 x6 : c01c8ed8
[   48.108778] x5 : c0098ec0 x4 : 
[   48.109223] x3 : 001d4600 x2 : 0040
[   48.109656] x1 : 0001 x0 : ffc473548000
[   48.110104] Call trace:
[   48.110287]  __kmalloc_node+0x214/0x368
[   48.110493]  __vmalloc_node_range+0xc4/0x298
[   48.110805]  copy_process+0x2c8/0x15c8
[   48.33]  kernel_clone+0x5c/0x3c0
[   48.111373]  kernel_thread+0x64/0x90
[   48.111604]  kthreadd+0x158/0x368
[   48.111810]  ret_from_fork+0x10/0x30
[   48.112336] Code: 17e9 b9402a62 b94008a1 11000421 (f8626802)
[   48.112884] ---[ end trace d4890e21e75419d5 ]---

Signed-off-by: Barry Song 
---
 kernel/dma/map_benchmark.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
index b1496e744c68..1b1b8ff875cb 100644
--- a/kernel/dma/map_benchmark.c
+++ b/kernel/dma/map_benchmark.c
@@ -147,8 +147,10 @@ static int do_map_benchmark(struct map_benchmark_data *map)
atomic64_set(>sum_sq_unmap, 0);
atomic64_set(>loops, 0);
 
-   for (i = 0; i < threads; i++)
+   for (i = 0; i < threads; i++) {
+   get_task_struct(tsk[i]);
wake_up_process(tsk[i]);
+   }
 
msleep_interruptible(map->bparam.seconds * 1000);
 
@@ -183,6 +185,8 @@ static int do_map_benchmark(struct map_benchmark_data *map)
}
 
 out:
+   for (i = 0; i < threads; i++)
+   put_task_struct(tsk[i]);
put_device(map->dev);
kfree(tsk);
return ret;
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] dma-mapping: benchmark: fix kernel crash when dma_map_single fails

2021-01-07 Thread Barry Song
if dma_map_single() fails, kernel will give the below oops since
task_struct has been destroyed and we are running into the memory
corruption due to use-after-free in kthread_stop():

[   48.095310] Unable to handle kernel paging request at virtual address 
00c473548040
[   48.095736] Mem abort info:
[   48.095864]   ESR = 0x9604
[   48.096025]   EC = 0x25: DABT (current EL), IL = 32 bits
[   48.096268]   SET = 0, FnV = 0
[   48.096401]   EA = 0, S1PTW = 0
[   48.096538] Data abort info:
[   48.096659]   ISV = 0, ISS = 0x0004
[   48.096820]   CM = 0, WnR = 0
[   48.097079] user pgtable: 4k pages, 48-bit VAs, pgdp=000104639000
[   48.098099] [00c473548040] pgd=, p4d=
[   48.098832] Internal error: Oops: 9604 [#1] PREEMPT SMP
[   48.099232] Modules linked in:
[   48.099387] CPU: 0 PID: 2 Comm: kthreadd Tainted: GW
[   48.099887] Hardware name: linux,dummy-virt (DT)
[   48.100078] pstate: 6005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
[   48.100516] pc : __kmalloc_node+0x214/0x368
[   48.100944] lr : __kmalloc_node+0x1f4/0x368
[   48.101458] sp : 800011f0bb80
[   48.101843] x29: 800011f0bb80 x28: c0098ec0
[   48.102330] x27:  x26: 001d4600
[   48.102648] x25: c0098ec0 x24: 800011b6a000
[   48.102988] x23:  x22: c0098ec0
[   48.10] x21: 8000101d7a54 x20: 0dc0
[   48.103657] x19: c0001e00 x18: 
[   48.104069] x17:  x16: 
[   48.105449] x15: 01aa0304e7b9 x14: 03b1
[   48.106401] x13: 8000122d5000 x12: 80001228d000
[   48.107296] x11: c0154340 x10: 
[   48.107862] x9 : 8fff x8 : c473527f
[   48.108326] x7 : 800011e62f58 x6 : c01c8ed8
[   48.108778] x5 : c0098ec0 x4 : 
[   48.109223] x3 : 001d4600 x2 : 0040
[   48.109656] x1 : 0001 x0 : ffc473548000
[   48.110104] Call trace:
[   48.110287]  __kmalloc_node+0x214/0x368
[   48.110493]  __vmalloc_node_range+0xc4/0x298
[   48.110805]  copy_process+0x2c8/0x15c8
[   48.33]  kernel_clone+0x5c/0x3c0
[   48.111373]  kernel_thread+0x64/0x90
[   48.111604]  kthreadd+0x158/0x368
[   48.111810]  ret_from_fork+0x10/0x30
[   48.112336] Code: 17e9 b9402a62 b94008a1 11000421 (f8626802)
[   48.112884] ---[ end trace d4890e21e75419d5 ]---

Signed-off-by: Barry Song 
---
 kernel/dma/map_benchmark.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
index b1496e744c68..1b1b8ff875cb 100644
--- a/kernel/dma/map_benchmark.c
+++ b/kernel/dma/map_benchmark.c
@@ -147,8 +147,10 @@ static int do_map_benchmark(struct map_benchmark_data *map)
atomic64_set(>sum_sq_unmap, 0);
atomic64_set(>loops, 0);
 
-   for (i = 0; i < threads; i++)
+   for (i = 0; i < threads; i++) {
+   get_task_struct(tsk[i]);
wake_up_process(tsk[i]);
+   }
 
msleep_interruptible(map->bparam.seconds * 1000);
 
@@ -183,6 +185,8 @@ static int do_map_benchmark(struct map_benchmark_data *map)
}
 
 out:
+   for (i = 0; i < threads; i++)
+   put_task_struct(tsk[i]);
put_device(map->dev);
kfree(tsk);
return ret;
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] dma-mapping: benchmark: check the validity of dma mask bits

2020-12-18 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Saturday, December 19, 2020 7:10 AM
> To: Song Bao Hua (Barry Song) ; h...@lst.de;
> m.szyprow...@samsung.com
> Cc: iommu@lists.linux-foundation.org; Linuxarm ; Dan
> Carpenter 
> Subject: Re: [PATCH] dma-mapping: benchmark: check the validity of dma mask
> bits
> 
> On 2020-12-12 10:18, Barry Song wrote:
> > While dma_mask_bits is larger than 64, the bahvaiour is undefined. On the
> > other hand, dma_mask_bits which is smaller than 20 (1MB) makes no sense
> > in real hardware.
> >
> > Reported-by: Dan Carpenter 
> > Signed-off-by: Barry Song 
> > ---
> >   kernel/dma/map_benchmark.c | 6 ++
> >   1 file changed, 6 insertions(+)
> >
> > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> > index b1496e744c68..19f661692073 100644
> > --- a/kernel/dma/map_benchmark.c
> > +++ b/kernel/dma/map_benchmark.c
> > @@ -214,6 +214,12 @@ static long map_benchmark_ioctl(struct file *file,
> unsigned int cmd,
> > return -EINVAL;
> > }
> >
> > +   if (map->bparam.dma_bits < 20 ||
> 
> FWIW I don't think we need to bother with a lower limit here - it's
> unsigned, and a pointlessly small value will fail gracefully when we
> come to actually set the mask anyway. We only need to protect kernel
> code from going wrong, not userspace from being stupid to its own detriment.

I am not sure if kernel driver can reject small dma mask bit if drivers
don't handle it properly.
As a month ago, when I was debugging dma map benchmark, I set a value
less than 32 to devices behind arm-smmu-v3, it could always succeed.
But dma_map_single() was always failing.
At that time, I didn't debug this issue. Not sure the latest status of
iommu driver.

drivers/iommu/intel/iommu.c used to have a dma_supported() to reject
small dma_mask:
static const struct dma_map_ops bounce_dma_ops = {
...
.dma_supported  = dma_direct_supported,
};


> 
> Robin.
> 
> > +   map->bparam.dma_bits > 64) {
> > +   pr_err("invalid dma_bits\n");
> > +   return -EINVAL;
> > +   }
> > +
> > if (map->bparam.node != NUMA_NO_NODE &&
> > !node_possible(map->bparam.node)) {
> > pr_err("invalid numa node\n");
> >

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2] dma-mapping: add unlikely hint for error path in dma_mapping_error

2020-12-13 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Heiner Kallweit [mailto:hkallwe...@gmail.com]
> Sent: Monday, December 14, 2020 5:33 AM
> To: Christoph Hellwig ; Marek Szyprowski
> ; Robin Murphy ; Song Bao Hua
> (Barry Song) 
> Cc: open list:AMD IOMMU (AMD-VI) ; Linux
> Kernel Mailing List 
> Subject: [PATCH v2] dma-mapping: add unlikely hint for error path in
> dma_mapping_error
> 
> Zillions of drivers use the unlikely() hint when checking the result of
> dma_mapping_error(). This is an inline function anyway, so we can move
> the hint into this function and remove it from drivers.
> 
> Signed-off-by: Heiner Kallweit 

not sure if this is really necessary. It seems the original code
is more readable. Readers can more easily understand we are
predicting the branch based on the return value of
dma_mapping_error().

Anyway, I don't object to this one. if other people like it, I am
also ok with it.

> ---
> v2:
> Split the big patch into the change for dma-mapping.h and follow-up
> patches per subsystem that will go through the trees of the respective
> maintainers.
> ---
>  include/linux/dma-mapping.h | 2 +-
>  kernel/dma/map_benchmark.c  | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 2e49996a8..6177e20b5 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -95,7 +95,7 @@ static inline int dma_mapping_error(struct device *dev,
> dma_addr_t dma_addr)
>  {
>   debug_dma_mapping_error(dev, dma_addr);
> 
> - if (dma_addr == DMA_MAPPING_ERROR)
> + if (unlikely(dma_addr == DMA_MAPPING_ERROR))
>   return -ENOMEM;
>   return 0;
>  }
> diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> index b1496e744..901420a5d 100644
> --- a/kernel/dma/map_benchmark.c
> +++ b/kernel/dma/map_benchmark.c
> @@ -78,7 +78,7 @@ static int map_benchmark_thread(void *data)
> 
>   map_stime = ktime_get();
>   dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE, map->dir);
> - if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
> + if (dma_mapping_error(map->dev, dma_addr)) {
>   pr_err("dma_map_single failed on %s\n",
>   dev_name(map->dev));
>   ret = -ENOMEM;
> --
> 2.29.2

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] dma-mapping: benchmark: check the validity of dma mask bits

2020-12-12 Thread Barry Song
While dma_mask_bits is larger than 64, the bahvaiour is undefined. On the
other hand, dma_mask_bits which is smaller than 20 (1MB) makes no sense
in real hardware.

Reported-by: Dan Carpenter 
Signed-off-by: Barry Song 
---
 kernel/dma/map_benchmark.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
index b1496e744c68..19f661692073 100644
--- a/kernel/dma/map_benchmark.c
+++ b/kernel/dma/map_benchmark.c
@@ -214,6 +214,12 @@ static long map_benchmark_ioctl(struct file *file, 
unsigned int cmd,
return -EINVAL;
}
 
+   if (map->bparam.dma_bits < 20 ||
+   map->bparam.dma_bits > 64) {
+   pr_err("invalid dma_bits\n");
+   return -EINVAL;
+   }
+
if (map->bparam.node != NUMA_NO_NODE &&
!node_possible(map->bparam.node)) {
pr_err("invalid numa node\n");
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [bug report] dma-mapping: add benchmark support for streaming DMA APIs

2020-12-09 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
> Sent: Wednesday, December 9, 2020 8:00 PM
> To: Song Bao Hua (Barry Song) 
> Cc: iommu@lists.linux-foundation.org
> Subject: [bug report] dma-mapping: add benchmark support for streaming DMA 
> APIs
> 
> Hello Barry Song,
> 
> The patch 65789daa8087: "dma-mapping: add benchmark support for
> streaming DMA APIs" from Nov 16, 2020, leads to the following static
> checker warning:
> 
>   kernel/dma/map_benchmark.c:241 map_benchmark_ioctl()
>   error: undefined (user controlled) shift '1 << (map->bparam.dma_bits)'
> 
> kernel/dma/map_benchmark.c
>191  static long map_benchmark_ioctl(struct file *file, unsigned int cmd,
>192  unsigned long arg)
>193  {
>194  struct map_benchmark_data *map = file->private_data;
>195  void __user *argp = (void __user *)arg;
>196  u64 old_dma_mask;
>197
>198  int ret;
>199
>200  if (copy_from_user(>bparam, argp, sizeof(map->bparam)))
>^
> Comes from the user
> 
>201  return -EFAULT;
>202
>203  switch (cmd) {
>204  case DMA_MAP_BENCHMARK:
>205  if (map->bparam.threads == 0 ||
>206  map->bparam.threads > DMA_MAP_MAX_THREADS) {
>207  pr_err("invalid thread number\n");
>208  return -EINVAL;
>209  }
>210
>211  if (map->bparam.seconds == 0 ||
>212  map->bparam.seconds > DMA_MAP_MAX_SECONDS) {
>213  pr_err("invalid duration seconds\n");
>214  return -EINVAL;
>215  }
>216
>217  if (map->bparam.node != NUMA_NO_NODE &&
>218  !node_possible(map->bparam.node)) {
>219  pr_err("invalid numa node\n");
>220  return -EINVAL;
>221  }
>222
>223  switch (map->bparam.dma_dir) {
>224  case DMA_MAP_BIDIRECTIONAL:
>225  map->dir = DMA_BIDIRECTIONAL;
>226  break;
>227  case DMA_MAP_FROM_DEVICE:
>228  map->dir = DMA_FROM_DEVICE;
>229  break;
>230  case DMA_MAP_TO_DEVICE:
>231  map->dir = DMA_TO_DEVICE;
>232  break;
>233  default:
>234  pr_err("invalid DMA direction\n");
>235  return -EINVAL;
>236  }
>237
>238  old_dma_mask = dma_get_mask(map->dev);
>239
>240  ret = dma_set_mask(map->dev,
>241 
> DMA_BIT_MASK(map->bparam.dma_bits));
>^^
> If this is more than 31 then the behavior is undefined (but in real life
> it will shift wrap).

Guess it should be less than 64?
For 64, it would be ~0ULL, otherwise, it will be 1ULL<https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=7679325702

I have some code like:
+   /* suppose the mininum DMA zone is 1MB in the world */
+   if (bits < 20 || bits > 64) {
+   fprintf(stderr, "invalid dma mask bit, must be in 20-64\n");
+   exit(1);
+   }

Maybe I should do the same thing in kernel as well.

> 
>242  if (ret) {
>243  pr_err("failed to set dma_mask on device 
> %s\n",
>244  dev_name(map->dev));
>245  return -EINVAL;
>246  }
>247
>248  ret = do_map_benchmark(map);
>249
>250  /*
>251   * restore the original dma_mask as many devices' 
> dma_mask
> are
>252   * set by architectures, acpi, busses. When we bind 
> them
> back
>253   * to their original drivers, those drivers shouldn't 
> see
>254   * dma_mask changed by benchmark
>255   */
>256  dma_set_mask(map->dev, old_dma_mask);
>257  break;
> 

RE: [PATCH] dma-mapping: Fix sizeof() mismatch on tsk allocation

2020-11-25 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Colin King [mailto:colin.k...@canonical.com]
> Sent: Thursday, November 26, 2020 3:05 AM
> To: Song Bao Hua (Barry Song) ; Christoph
> Hellwig ; Marek Szyprowski ;
> Robin Murphy ; iommu@lists.linux-foundation.org
> Cc: kernel-janit...@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: [PATCH] dma-mapping: Fix sizeof() mismatch on tsk allocation
> 
> From: Colin Ian King 
> 
> An incorrect sizeof() is being used, sizeof(tsk) is not correct, it should
> be sizeof(*tsk). Fix it.
> 
> Addresses-Coverity: ("Sizeof not portable (SIZEOF_MISMATCH)")
> Fixes: bfd2defed94d ("dma-mapping: add benchmark support for streaming
> DMA APIs")
> Signed-off-by: Colin Ian King 
> ---
>  kernel/dma/map_benchmark.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> index e1e37603d01b..b1496e744c68 100644
> --- a/kernel/dma/map_benchmark.c
> +++ b/kernel/dma/map_benchmark.c
> @@ -121,7 +121,7 @@ static int do_map_benchmark(struct
> map_benchmark_data *map)
>   int ret = 0;
>   int i;
> 
> - tsk = kmalloc_array(threads, sizeof(tsk), GFP_KERNEL);
> +     tsk = kmalloc_array(threads, sizeof(*tsk), GFP_KERNEL);

The size is same. But the change is correct.
Acked-by: Barry Song 

>   if (!tsk)
>   return -ENOMEM;
> 
Thanks
Barry


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] dma-mapping: fix an uninitialized pointer read due to typo in argp assignment

2020-11-25 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Colin King [mailto:colin.k...@canonical.com]
> Sent: Thursday, November 26, 2020 2:56 AM
> To: Song Bao Hua (Barry Song) ; Christoph
> Hellwig ; Marek Szyprowski ;
> Robin Murphy ; iommu@lists.linux-foundation.org
> Cc: kernel-janit...@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: [PATCH] dma-mapping: fix an uninitialized pointer read due to typo in
> argp assignment
> 
> From: Colin Ian King 
> 
> The assignment of argp is currently using argp as the source because of
> a typo. Fix this by assigning it the value passed in arg instead.
> 
> Addresses-Coverity: ("Uninitialized pointer read")
> Fixes: bfd2defed94d ("dma-mapping: add benchmark support for streaming
> DMA APIs")
> Signed-off-by: Colin Ian King 

Acked-by: Barry Song 

> ---
>  kernel/dma/map_benchmark.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> index ca616b664f72..e1e37603d01b 100644
> --- a/kernel/dma/map_benchmark.c
> +++ b/kernel/dma/map_benchmark.c
> @@ -192,7 +192,7 @@ static long map_benchmark_ioctl(struct file *file,
> unsigned int cmd,
>   unsigned long arg)
>  {
>   struct map_benchmark_data *map = file->private_data;
> - void __user *argp = (void __user *)argp;
> + void __user *argp = (void __user *)arg;
>   u64 old_dma_mask;
> 
>   int ret;
> --
> 2.29.2

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-11-15 Thread Barry Song
Nowadays, there are increasing requirements to benchmark the performance
of dma_map and dma_unmap particually while the device is attached to an
IOMMU.

This patch enables the support. Users can run specified number of threads
to do dma_map_page and dma_unmap_page on a specific NUMA node with the
specified duration. Then dma_map_benchmark will calculate the average
latency for map and unmap.

A difficulity for this benchmark is that dma_map/unmap APIs must run on
a particular device. Each device might have different backend of IOMMU or
non-IOMMU.

So we use the driver_override to bind dma_map_benchmark to a particual
device by:
For platform devices:
echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
echo xxx > /sys/bus/platform/drivers/xxx/unbind
echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind

For PCI devices:
echo dma_map_benchmark > /sys/bus/pci/devices/:00:01.0/driver_override
echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind
echo :00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind

Cc: Will Deacon 
Cc: Shuah Khan 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Signed-off-by: Barry Song 
---
-v4:
  * add dma direction support according to Christoph Hellwig's comment;
  * add dma mask bit set according to Christoph Hellwig's comment;
  * make the benchmark depend on DEBUG_FS according to John Garry's comment;
  * strictly check parameters in ioctl;
  * fixed more than 80 char in one line;

 kernel/dma/Kconfig |   9 +
 kernel/dma/Makefile|   1 +
 kernel/dma/map_benchmark.c | 361 +
 3 files changed, 371 insertions(+)
 create mode 100644 kernel/dma/map_benchmark.c

diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index c99de4a21458..07f30651b83d 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -225,3 +225,12 @@ config DMA_API_DEBUG_SG
  is technically out-of-spec.
 
  If unsure, say N.
+
+config DMA_MAP_BENCHMARK
+   bool "Enable benchmarking of streaming DMA mapping"
+   depends on DEBUG_FS
+   help
+ Provides /sys/kernel/debug/dma_map_benchmark that helps with testing
+ performance of dma_(un)map_page.
+
+ See tools/testing/selftests/dma/dma_map_benchmark.c
diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile
index dc755ab68aab..7aa6b26b1348 100644
--- a/kernel/dma/Makefile
+++ b/kernel/dma/Makefile
@@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG)   += debug.o
 obj-$(CONFIG_SWIOTLB)  += swiotlb.o
 obj-$(CONFIG_DMA_COHERENT_POOL)+= pool.o
 obj-$(CONFIG_DMA_REMAP)+= remap.o
+obj-$(CONFIG_DMA_MAP_BENCHMARK)+= map_benchmark.o
diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
new file mode 100644
index ..41d44a75adb2
--- /dev/null
+++ b/kernel/dma/map_benchmark.c
@@ -0,0 +1,361 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Hisilicon Limited.
+ */
+
+#define pr_fmt(fmt)KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DMA_MAP_BENCHMARK  _IOWR('d', 1, struct map_benchmark)
+#define DMA_MAP_MAX_THREADS1024
+#define DMA_MAP_MAX_SECONDS300
+
+#define DMA_MAP_BIDIRECTIONAL  0
+#define DMA_MAP_TO_DEVICE  1
+#define DMA_MAP_FROM_DEVICE2
+
+struct map_benchmark {
+   __u64 avg_map_100ns; /* average map latency in 100ns */
+   __u64 map_stddev; /* standard deviation of map latency */
+   __u64 avg_unmap_100ns; /* as above */
+   __u64 unmap_stddev;
+   __u32 threads; /* how many threads will do map/unmap in parallel */
+   __u32 seconds; /* how long the test will last */
+   __s32 node; /* which numa node this benchmark will run on */
+   __u32 dma_bits; /* DMA addressing capability */
+   __u32 dma_dir; /* DMA data direction */
+   __u64 expansion[10];/* For future use */
+};
+
+struct map_benchmark_data {
+   struct map_benchmark bparam;
+   struct device *dev;
+   struct dentry  *debugfs;
+   enum dma_data_direction dir;
+   atomic64_t sum_map_100ns;
+   atomic64_t sum_unmap_100ns;
+   atomic64_t sum_sq_map;
+   atomic64_t sum_sq_unmap;
+   atomic64_t loops;
+};
+
+static int map_benchmark_thread(void *data)
+{
+   void *buf;
+   dma_addr_t dma_addr;
+   struct map_benchmark_data *map = data;
+   int ret = 0;
+
+   buf = (void *)__get_free_page(GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+
+   while (!kthread_should_stop())  {
+   u64 map_100ns, unmap_100ns, map_sq, unmap_sq;
+   ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
+   ktime_t map_delta, unmap_delta;
+
+   /*
+* for a non-coherent device, if we don't stain them in the
+ 

[PATCH v4 0/2] dma-mapping: provide a benchmark for streaming DMA mapping

2020-11-15 Thread Barry Song
Nowadays, there are increasing requirements to benchmark the performance
of dma_map and dma_unmap particually while the device is attached to an
IOMMU.

This patchset provides the benchmark infrastruture for streaming DMA
mapping. The architecture of the code is pretty much similar with GUP
benchmark:
* mm/gup_benchmark.c provides kernel interface;
* tools/testing/selftests/vm/gup_benchmark.c provides user program to
call the interface provided by mm/gup_benchmark.c.

In our case, kernel/dma/map_benchmark.c is like mm/gup_benchmark.c;
tools/testing/selftests/dma/dma_map_benchmark.c is like tools/testing/
selftests/vm/gup_benchmark.c

A major difference with GUP benchmark is DMA_MAP benchmark needs to run
on a device. Considering one board with below devices and IOMMUs
device A  --- IOMMU 1
device B  --- IOMMU 2
device C  --- non-IOMMU

Different devices might attach to different IOMMU or non-IOMMU. To make
benchmark run, we can either
* create a virtual device and hack the kernel code to attach the virtual
device to IOMMU1, IOMMU2 or non-IOMMU.
* use the existing driver_override mechinism, unbind device A,B, OR c from
their original driver and bind A to dma_map_benchmark platform driver or
pci driver for benchmarking.

In this patchset, I prefer to use the driver_override and avoid the ugly
hack in kernel. We can dynamically switch device behind different IOMMUs
to get the performance of IOMMU or non-IOMMU.

-v4:
  * add dma direction support according to Christoph Hellwig's comment;
  * add dma mask bit set according to Christoph Hellwig's comment;
  * make the benchmark depend on DEBUG_FS according to John Garry's comment;
  * strictly check parameters in ioctl
-v3:
  * fix build issues reported by 0day kernel test robot
-v2:
  * add PCI support; v1 supported platform devices only
  * replace ssleep by msleep_interruptible() to permit users to exit
benchmark before it is completed
  * many changes according to Robin's suggestions, thanks! Robin
- add standard deviation output to reflect the worst case
- check users' parameters strictly like the number of threads
- make cache dirty before dma_map
- fix unpaired dma_map_page and dma_unmap_single;
- remove redundant "long long" before ktime_to_ns();
- use devm_add_action()

Barry Song (2):
  dma-mapping: add benchmark support for streaming DMA APIs
  selftests/dma: add test application for DMA_MAP_BENCHMARK

 MAINTAINERS   |   6 +
 kernel/dma/Kconfig|   9 +
 kernel/dma/Makefile   |   1 +
 kernel/dma/map_benchmark.c| 361 ++
 tools/testing/selftests/dma/Makefile  |   6 +
 tools/testing/selftests/dma/config|   1 +
 .../testing/selftests/dma/dma_map_benchmark.c | 123 ++
 7 files changed, 507 insertions(+)
 create mode 100644 kernel/dma/map_benchmark.c
 create mode 100644 tools/testing/selftests/dma/Makefile
 create mode 100644 tools/testing/selftests/dma/config
 create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 2/2] selftests/dma: add test application for DMA_MAP_BENCHMARK

2020-11-15 Thread Barry Song
This patch provides the test application for DMA_MAP_BENCHMARK.

Before running the test application, we need to bind a device to dma_map_
benchmark driver. For example, unbind "xxx" from its original driver and
bind to dma_map_benchmark:

echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
echo xxx > /sys/bus/platform/drivers/xxx/unbind
echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind

Another example for PCI devices:
echo dma_map_benchmark > /sys/bus/pci/devices/:00:01.0/driver_override
echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind
echo :00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind

The below command will run 16 threads on numa node 0 for 10 seconds on
the device bound to dma_map_benchmark platform_driver or pci_driver:
./dma_map_benchmark -t 16 -s 10 -n 0
dma mapping benchmark: threads:16 seconds:10
average map latency(us):1.1 standard deviation:1.9
average unmap latency(us):0.5 standard deviation:0.8

Cc: Will Deacon 
Cc: Shuah Khan 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Signed-off-by: Barry Song 
---
 -v4:
  * add dma direction and mask_bit parameters

 MAINTAINERS   |   6 +
 tools/testing/selftests/dma/Makefile  |   6 +
 tools/testing/selftests/dma/config|   1 +
 .../testing/selftests/dma/dma_map_benchmark.c | 123 ++
 4 files changed, 136 insertions(+)
 create mode 100644 tools/testing/selftests/dma/Makefile
 create mode 100644 tools/testing/selftests/dma/config
 create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c

diff --git a/MAINTAINERS b/MAINTAINERS
index e451dcce054f..bc851ffd3114 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5247,6 +5247,12 @@ F:   include/linux/dma-mapping.h
 F: include/linux/dma-map-ops.h
 F: kernel/dma/
 
+DMA MAPPING BENCHMARK
+M: Barry Song 
+L: iommu@lists.linux-foundation.org
+F: kernel/dma/map_benchmark.c
+F: tools/testing/selftests/dma/
+
 DMA-BUF HEAPS FRAMEWORK
 M: Sumit Semwal 
 R: Benjamin Gaignard 
diff --git a/tools/testing/selftests/dma/Makefile 
b/tools/testing/selftests/dma/Makefile
new file mode 100644
index ..aa8e8b5b3864
--- /dev/null
+++ b/tools/testing/selftests/dma/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+CFLAGS += -I../../../../usr/include/
+
+TEST_GEN_PROGS := dma_map_benchmark
+
+include ../lib.mk
diff --git a/tools/testing/selftests/dma/config 
b/tools/testing/selftests/dma/config
new file mode 100644
index ..6102ee3c43cd
--- /dev/null
+++ b/tools/testing/selftests/dma/config
@@ -0,0 +1 @@
+CONFIG_DMA_MAP_BENCHMARK=y
diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c 
b/tools/testing/selftests/dma/dma_map_benchmark.c
new file mode 100644
index ..7065163a8388
--- /dev/null
+++ b/tools/testing/selftests/dma/dma_map_benchmark.c
@@ -0,0 +1,123 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Hisilicon Limited.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DMA_MAP_BENCHMARK  _IOWR('d', 1, struct map_benchmark)
+#define DMA_MAP_MAX_THREADS1024
+#define DMA_MAP_MAX_SECONDS 300
+
+#define DMA_MAP_BIDIRECTIONAL  0
+#define DMA_MAP_TO_DEVICE  1
+#define DMA_MAP_FROM_DEVICE2
+
+static char *directions[] = {
+   "BIDIRECTIONAL",
+   "TO_DEVICE",
+   "FROM_DEVICE",
+};
+
+struct map_benchmark {
+   __u64 avg_map_100ns; /* average map latency in 100ns */
+   __u64 map_stddev; /* standard deviation of map latency */
+   __u64 avg_unmap_100ns; /* as above */
+   __u64 unmap_stddev;
+   __u32 threads; /* how many threads will do map/unmap in parallel */
+   __u32 seconds; /* how long the test will last */
+   __s32 node; /* which numa node this benchmark will run on */
+   __u32 dma_bits; /* DMA addressing capability */
+   __u32 dma_dir; /* DMA data direction */
+   __u64 expansion[10];/* For future use */
+};
+
+int main(int argc, char **argv)
+{
+   struct map_benchmark map;
+   int fd, opt;
+   /* default single thread, run 20 seconds on NUMA_NO_NODE */
+   int threads = 1, seconds = 20, node = -1;
+   /* default dma mask 32bit, bidirectional DMA */
+   int bits = 32, dir = DMA_MAP_BIDIRECTIONAL;
+
+   int cmd = DMA_MAP_BENCHMARK;
+   char *p;
+
+   while ((opt = getopt(argc, argv, "t:s:n:b:d:")) != -1) {
+   switch (opt) {
+   case 't':
+   threads = atoi(optarg);
+   break;
+   case 's':
+   seconds = atoi(optarg);
+   break;
+   case 'n':
+   node = atoi(optarg);
+   break;
+   case 'b':
+   bits = atoi(optarg);
+   break;

RE: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-11-15 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Sunday, November 15, 2020 9:45 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Christoph Hellwig ; iommu@lists.linux-foundation.org;
> robin.mur...@arm.com; m.szyprow...@samsung.com; Linuxarm
> ; linux-kselft...@vger.kernel.org; xuwei (O)
> ; Joerg Roedel ; Will Deacon
> ; Shuah Khan 
> Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for
> streaming DMA APIs
> 
> On Sun, Nov 15, 2020 at 12:11:15AM +, Song Bao Hua (Barry Song)
> wrote:
> >
> > Checkpatch has changed 80 to 100. That's probably why my local checkpatch
> didn't report any warning:
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=
> bdc48fa11e46f867ea4d
> >
> > I am happy to change them to be less than 80 if you like.
> 
> Don't rely on checkpath, is is broken.  Look at the codingstyle document.
> 
> > > I think this needs to set a dma mask as behavior for unlimited dma
> > > mask vs the default 32-bit one can be very different.
> >
> > I actually prefer users bind real devices with real dma_mask to test rather
> than force to change
> > the dma_mask in this benchmark.
> 
> The mask is set by the driver, not the device.  So you need to set when
> when you bind, real device or not.

Yep while it is a little bit tricky.

Sometimes, it is done by "device" in architectures, e.g. there are lots of
dma_mask configuration code in arch/arm/mach-xxx.
arch/arm/mach-davinci/da850.c
static u64 da850_vpif_dma_mask = DMA_BIT_MASK(32);
static struct platform_device da850_vpif_dev = {
.name   = "vpif",
.id = -1,
.dev= {
.dma_mask   = _vpif_dma_mask,
.coherent_dma_mask  = DMA_BIT_MASK(32),
},
.resource   = da850_vpif_resource,
.num_resources  = ARRAY_SIZE(da850_vpif_resource),
};

Sometimes, it is done by "of" or "acpi", for example:
drivers/acpi/arm64/iort.c
void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size)
{
u64 end, mask, dmaaddr = 0, size = 0, offset = 0;
int ret;

...

ret = acpi_dma_get_range(dev, , , );
if (!ret) {
/*
 * Limit coherent and dma mask based on size retrieved from
 * firmware.
 */
end = dmaaddr + size - 1;
mask = DMA_BIT_MASK(ilog2(end) + 1);
dev->bus_dma_limit = end;
dev->coherent_dma_mask = mask;
*dev->dma_mask = mask;
}
...
}

Sometimes, it is done by "bus", for example, ISA:
isa_dev->dev.coherent_dma_mask = DMA_BIT_MASK(24);
isa_dev->dev.dma_mask = _dev->dev.coherent_dma_mask;

error = device_register(_dev->dev);
if (error) {
put_device(_dev->dev);
break;
}

And in many cases, it is done by driver. On the ARM64 server platform I am 
testing,
actually rarely drivers set dma_mask.

So to make the dma benchmark work on all platforms, it seems it is worth
to add a dma_mask_bit parameter. But, in order to avoid breaking the
dma_mask of those devices whose dma_mask are set by architectures, 
acpi and bus, it seems we need to do the below in dma_benchmark:

u64 old_mask;

old_mask = dma_get_mask(dev);

dma_set_mask(dev, _mask);

do_map_benchmark();

/* restore old dma_mask so that the dma_mask of the device is not changed due to
benchmark when it is bound back to its original driver */
dma_set_mask(dev, _mask);

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-11-14 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Sunday, November 15, 2020 5:54 AM
> To: Song Bao Hua (Barry Song) 
> Cc: iommu@lists.linux-foundation.org; h...@lst.de; robin.mur...@arm.com;
> m.szyprow...@samsung.com; Linuxarm ;
> linux-kselft...@vger.kernel.org; xuwei (O) ; Joerg
> Roedel ; Will Deacon ; Shuah Khan
> 
> Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for
> streaming DMA APIs
> 
> Lots of > 80 char lines.  Please fix up the style.

Checkpatch has changed 80 to 100. That's probably why my local checkpatch 
didn't report any warning:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bdc48fa11e46f867ea4d

I am happy to change them to be less than 80 if you like.

> 
> I think this needs to set a dma mask as behavior for unlimited dma
> mask vs the default 32-bit one can be very different. 

I actually prefer users bind real devices with real dma_mask to test rather 
than force to change
the dma_mask in this benchmark.

Some device might have 32bit dma_mask while some others might have unlimited. 
But both of
them can bind to this driver or unbind from it after the test is done. So users 
just need to bind
those different real devices with different real dma_mask to dma_benchmark.

This can reflect the real performance of the real device better, I think.

> I also think you need to be able to pass the direction or have different tests
> for directions.  bidirectional is not exactly heavily used and pays
> more cache management penality.

For this, I'd like to increase a direction option in the test app and pass the 
option to the benchmark
driver.

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-11-11 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: John Garry
> Sent: Wednesday, November 11, 2020 10:37 PM
> To: Song Bao Hua (Barry Song) ;
> iommu@lists.linux-foundation.org; h...@lst.de; robin.mur...@arm.com;
> m.szyprow...@samsung.com
> Cc: linux-kselft...@vger.kernel.org; Will Deacon ; Joerg
> Roedel ; Linuxarm ; xuwei (O)
> ; Shuah Khan 
> Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for
> streaming DMA APIs
> 
> On 11/11/2020 01:29, Song Bao Hua (Barry Song) wrote:
> > I'd like to think checking this here would be overdesign. We just give 
> > users the
> > freedom to bind any device they care about to the benchmark driver. Usually
> > that means a real hardware either behind an IOMMU or through a direct
> > mapping.
> >
> > if for any reason users put a wrong "device", that is the choice of users.
> 
> Right, but if the device simply has no DMA ops supported, it could be
> better to fail the probe rather than let them try the test at all.
> 
>   Anyhow,
> > the below code will still handle it properly and users will get a report in 
> > which
> > everything is zero.
> >
> > +static int map_benchmark_thread(void *data)
> > +{
> > ...
> > +   dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE,
> DMA_BIDIRECTIONAL);
> > +   if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
> 
> Doing this is proper, but I am not sure if this tells the user the real
> problem.

Telling users the real problem isn't the design intention of this test
benchmark. It is never the purpose of this benchmark.

> 
> > +   pr_err("dma_map_single failed on %s\n",
> dev_name(map->dev));
> 
> Not sure why use pr_err() over dev_err().

We are reporting errors in dma-benchmark driver rather than reporting errors
in the driver of the specific device. I think we should have "dma-benchmark"
as the prefix while printing the name of the device by dev_name().

> 
> > +   ret = -ENOMEM;
> > +   goto out;
> > +   }
> 
> Thanks,
> John

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-11-10 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: John Garry
> Sent: Tuesday, November 10, 2020 9:39 PM
> To: Song Bao Hua (Barry Song) ;
> iommu@lists.linux-foundation.org; h...@lst.de; robin.mur...@arm.com;
> m.szyprow...@samsung.com
> Cc: linux-kselft...@vger.kernel.org; Will Deacon ; Joerg
> Roedel ; Linuxarm ; xuwei (O)
> ; Shuah Khan 
> Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for
> streaming DMA APIs
> 
> On 10/11/2020 08:10, Song Bao Hua (Barry Song) wrote:
> > Hello Robin, Christoph,
> > Any further comment? John suggested that "depends on DEBUG_FS" should
> be added in Kconfig.
> > I am collecting more comments to send v4 together with fixing this minor
> issue :-)
> >
> > Thanks
> > Barry
> >
> >> -Original Message-
> >> From: Song Bao Hua (Barry Song)
> >> Sent: Monday, November 2, 2020 9:07 PM
> >> To: iommu@lists.linux-foundation.org; h...@lst.de;
> robin.mur...@arm.com;
> >> m.szyprow...@samsung.com
> >> Cc: Linuxarm ; linux-kselft...@vger.kernel.org;
> xuwei
> >> (O) ; Song Bao Hua (Barry Song)
> >> ; Joerg Roedel ; Will
> Deacon
> >> ; Shuah Khan 
> >> Subject: [PATCH v3 1/2] dma-mapping: add benchmark support for
> streaming
> >> DMA APIs
> >>
> >> Nowadays, there are increasing requirements to benchmark the
> performance
> >> of dma_map and dma_unmap particually while the device is attached to an
> >> IOMMU.
> >>
> >> This patch enables the support. Users can run specified number of threads
> to
> >> do dma_map_page and dma_unmap_page on a specific NUMA node with
> the
> >> specified duration. Then dma_map_benchmark will calculate the average
> >> latency for map and unmap.
> >>
> >> A difficulity for this benchmark is that dma_map/unmap APIs must run on a
> >> particular device. Each device might have different backend of IOMMU or
> >> non-IOMMU.
> >>
> >> So we use the driver_override to bind dma_map_benchmark to a particual
> >> device by:
> >> For platform devices:
> >> echo dma_map_benchmark >
> /sys/bus/platform/devices/xxx/driver_override
> >> echo xxx > /sys/bus/platform/drivers/xxx/unbind
> >> echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
> >>
> 
> Hi Barry,
> 
> >> For PCI devices:
> >> echo dma_map_benchmark >
> >> /sys/bus/pci/devices/:00:01.0/driver_override
> >> echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo :00:01.0 >
> >> /sys/bus/pci/drivers/dma_map_benchmark/bind
> 
> Do we need to check if the device to which we attach actually has DMA
> mapping capability?

Hello John,

I'd like to think checking this here would be overdesign. We just give users the
freedom to bind any device they care about to the benchmark driver. Usually
that means a real hardware either behind an IOMMU or through a direct
mapping.

if for any reason users put a wrong "device", that is the choice of users. 
Anyhow,
the below code will still handle it properly and users will get a report in 
which
everything is zero.

+static int map_benchmark_thread(void *data)
+{
...
+   dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE, 
DMA_BIDIRECTIONAL);
+   if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
+   pr_err("dma_map_single failed on %s\n", 
dev_name(map->dev));
+   ret = -ENOMEM;
+   goto out;
+   }
...
+}

> 
> >>
> >> Cc: Joerg Roedel 
> >> Cc: Will Deacon 
> >> Cc: Shuah Khan 
> >> Cc: Christoph Hellwig 
> >> Cc: Marek Szyprowski 
> >> Cc: Robin Murphy 
> >> Signed-off-by: Barry Song 
> >> ---
> 
> Thanks,
> John

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-11-10 Thread Song Bao Hua (Barry Song)
Hello Robin, Christoph,
Any further comment? John suggested that "depends on DEBUG_FS" should be added 
in Kconfig.
I am collecting more comments to send v4 together with fixing this minor issue 
:-)

Thanks
Barry

> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Monday, November 2, 2020 9:07 PM
> To: iommu@lists.linux-foundation.org; h...@lst.de; robin.mur...@arm.com;
> m.szyprow...@samsung.com
> Cc: Linuxarm ; linux-kselft...@vger.kernel.org; xuwei
> (O) ; Song Bao Hua (Barry Song)
> ; Joerg Roedel ; Will Deacon
> ; Shuah Khan 
> Subject: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming
> DMA APIs
> 
> Nowadays, there are increasing requirements to benchmark the performance
> of dma_map and dma_unmap particually while the device is attached to an
> IOMMU.
> 
> This patch enables the support. Users can run specified number of threads to
> do dma_map_page and dma_unmap_page on a specific NUMA node with the
> specified duration. Then dma_map_benchmark will calculate the average
> latency for map and unmap.
> 
> A difficulity for this benchmark is that dma_map/unmap APIs must run on a
> particular device. Each device might have different backend of IOMMU or
> non-IOMMU.
> 
> So we use the driver_override to bind dma_map_benchmark to a particual
> device by:
> For platform devices:
> echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
> echo xxx > /sys/bus/platform/drivers/xxx/unbind
> echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
> 
> For PCI devices:
> echo dma_map_benchmark >
> /sys/bus/pci/devices/:00:01.0/driver_override
> echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo :00:01.0 >
> /sys/bus/pci/drivers/dma_map_benchmark/bind
> 
> Cc: Joerg Roedel 
> Cc: Will Deacon 
> Cc: Shuah Khan 
> Cc: Christoph Hellwig 
> Cc: Marek Szyprowski 
> Cc: Robin Murphy 
> Signed-off-by: Barry Song 
> ---
> -v3:
>   * fix build issues reported by 0day kernel test robot
> -v2:
>   * add PCI support; v1 supported platform devices only
>   * replace ssleep by msleep_interruptible() to permit users to exit
> benchmark before it is completed
>   * many changes according to Robin's suggestions, thanks! Robin
> - add standard deviation output to reflect the worst case
> - check users' parameters strictly like the number of threads
> - make cache dirty before dma_map
> - fix unpaired dma_map_page and dma_unmap_single;
> - remove redundant "long long" before ktime_to_ns();
> - use devm_add_action()
> 
>  kernel/dma/Kconfig |   8 +
>  kernel/dma/Makefile|   1 +
>  kernel/dma/map_benchmark.c | 296
> +
>  3 files changed, 305 insertions(+)
>  create mode 100644 kernel/dma/map_benchmark.c
> 
> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index
> c99de4a21458..949c53da5991 100644
> --- a/kernel/dma/Kconfig
> +++ b/kernel/dma/Kconfig
> @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG
> is technically out-of-spec.
> 
> If unsure, say N.
> +
> +config DMA_MAP_BENCHMARK
> + bool "Enable benchmarking of streaming DMA mapping"
> + help
> +   Provides /sys/kernel/debug/dma_map_benchmark that helps with
> testing
> +   performance of dma_(un)map_page.
> +
> +   See tools/testing/selftests/dma/dma_map_benchmark.c
> diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index
> dc755ab68aab..7aa6b26b1348 100644
> --- a/kernel/dma/Makefile
> +++ b/kernel/dma/Makefile
> @@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG) += debug.o
>  obj-$(CONFIG_SWIOTLB)+= swiotlb.o
>  obj-$(CONFIG_DMA_COHERENT_POOL)  += pool.o
>  obj-$(CONFIG_DMA_REMAP)  += remap.o
> +obj-$(CONFIG_DMA_MAP_BENCHMARK)  += map_benchmark.o
> diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> new file mode 100644 index ..dc4e5ff48a2d
> --- /dev/null
> +++ b/kernel/dma/map_benchmark.c
> @@ -0,0 +1,296 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2020 Hisilicon Limited.
> + */
> +
> +#define pr_fmt(fmt)  KBUILD_MODNAME ": " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define DMA_MAP_BENCHMARK_IOWR('d', 1, struct map_benchmark)
> +#define DMA_MAP_MAX_THREADS  1024
> +#define DMA_MAP_MAX_SECONDS  300
> +
> +struct map_benchmark {
> + __u64 avg_map_100ns; /* average map latency in 100ns */
&

RE: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-11-02 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: John Garry
> Sent: Monday, November 2, 2020 10:19 PM
> To: Song Bao Hua (Barry Song) ;
> iommu@lists.linux-foundation.org; h...@lst.de; robin.mur...@arm.com;
> m.szyprow...@samsung.com
> Cc: linux-kselft...@vger.kernel.org; Shuah Khan ; Joerg
> Roedel ; Linuxarm ; xuwei (O)
> ; Will Deacon 
> Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for
> streaming DMA APIs
> 
> On 02/11/2020 08:06, Barry Song wrote:
> > Nowadays, there are increasing requirements to benchmark the performance
> > of dma_map and dma_unmap particually while the device is attached to an
> > IOMMU.
> >
> > This patch enables the support. Users can run specified number of threads
> > to do dma_map_page and dma_unmap_page on a specific NUMA node with
> the
> > specified duration. Then dma_map_benchmark will calculate the average
> > latency for map and unmap.
> >
> > A difficulity for this benchmark is that dma_map/unmap APIs must run on
> > a particular device. Each device might have different backend of IOMMU or
> > non-IOMMU.
> >
> > So we use the driver_override to bind dma_map_benchmark to a particual
> > device by:
> > For platform devices:
> > echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
> > echo xxx > /sys/bus/platform/drivers/xxx/unbind
> > echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
> >
> > For PCI devices:
> > echo dma_map_benchmark >
> /sys/bus/pci/devices/:00:01.0/driver_override
> > echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind
> > echo :00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind
> >
> > Cc: Joerg Roedel 
> > Cc: Will Deacon 
> > Cc: Shuah Khan 
> > Cc: Christoph Hellwig 
> > Cc: Marek Szyprowski 
> > Cc: Robin Murphy 
> > Signed-off-by: Barry Song 
> > ---
> > -v3:
> >* fix build issues reported by 0day kernel test robot
> > -v2:
> >* add PCI support; v1 supported platform devices only
> >* replace ssleep by msleep_interruptible() to permit users to exit
> >  benchmark before it is completed
> >* many changes according to Robin's suggestions, thanks! Robin
> >  - add standard deviation output to reflect the worst case
> >  - check users' parameters strictly like the number of threads
> >  - make cache dirty before dma_map
> >  - fix unpaired dma_map_page and dma_unmap_single;
> >  - remove redundant "long long" before ktime_to_ns();
> >  - use devm_add_action()
> >
> >   kernel/dma/Kconfig |   8 +
> >   kernel/dma/Makefile|   1 +
> >   kernel/dma/map_benchmark.c | 296
> +
> >   3 files changed, 305 insertions(+)
> >   create mode 100644 kernel/dma/map_benchmark.c
> >
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index c99de4a21458..949c53da5991 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG
> >   is technically out-of-spec.
> >
> >   If unsure, say N.
> > +
> > +config DMA_MAP_BENCHMARK
> > +   bool "Enable benchmarking of streaming DMA mapping"
> > +   help
> > + Provides /sys/kernel/debug/dma_map_benchmark that helps with
> testing
> > + performance of dma_(un)map_page.
> 
> Since this is a driver, any reason for which it cannot be loadable? If
> so, it seems any functionality would depend on DEBUG FS, I figure that's
> just how we work for debugfs.

We depend on kthread_bind_mask which isn't an export_symbol.
Maybe worth to send a patch to export it?

> 
> Thanks,
> John
> 
> > +
> > + See tools/testing/selftests/dma/dma_map_benchmark.c
> > diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile
> > index dc755ab68aab..7aa6b26b1348 100644
> > --- a/kernel/dma/Makefile
> > +++ b/kernel/dma/Makefile

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-11-02 Thread Barry Song
Nowadays, there are increasing requirements to benchmark the performance
of dma_map and dma_unmap particually while the device is attached to an
IOMMU.

This patch enables the support. Users can run specified number of threads
to do dma_map_page and dma_unmap_page on a specific NUMA node with the
specified duration. Then dma_map_benchmark will calculate the average
latency for map and unmap.

A difficulity for this benchmark is that dma_map/unmap APIs must run on
a particular device. Each device might have different backend of IOMMU or
non-IOMMU.

So we use the driver_override to bind dma_map_benchmark to a particual
device by:
For platform devices:
echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
echo xxx > /sys/bus/platform/drivers/xxx/unbind
echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind

For PCI devices:
echo dma_map_benchmark > /sys/bus/pci/devices/:00:01.0/driver_override
echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind
echo :00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Shuah Khan 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Signed-off-by: Barry Song 
---
-v3:
  * fix build issues reported by 0day kernel test robot
-v2:
  * add PCI support; v1 supported platform devices only
  * replace ssleep by msleep_interruptible() to permit users to exit
benchmark before it is completed
  * many changes according to Robin's suggestions, thanks! Robin
- add standard deviation output to reflect the worst case
- check users' parameters strictly like the number of threads
- make cache dirty before dma_map
- fix unpaired dma_map_page and dma_unmap_single;
- remove redundant "long long" before ktime_to_ns();
- use devm_add_action()

 kernel/dma/Kconfig |   8 +
 kernel/dma/Makefile|   1 +
 kernel/dma/map_benchmark.c | 296 +
 3 files changed, 305 insertions(+)
 create mode 100644 kernel/dma/map_benchmark.c

diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index c99de4a21458..949c53da5991 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG
  is technically out-of-spec.
 
  If unsure, say N.
+
+config DMA_MAP_BENCHMARK
+   bool "Enable benchmarking of streaming DMA mapping"
+   help
+ Provides /sys/kernel/debug/dma_map_benchmark that helps with testing
+ performance of dma_(un)map_page.
+
+ See tools/testing/selftests/dma/dma_map_benchmark.c
diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile
index dc755ab68aab..7aa6b26b1348 100644
--- a/kernel/dma/Makefile
+++ b/kernel/dma/Makefile
@@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG)   += debug.o
 obj-$(CONFIG_SWIOTLB)  += swiotlb.o
 obj-$(CONFIG_DMA_COHERENT_POOL)+= pool.o
 obj-$(CONFIG_DMA_REMAP)+= remap.o
+obj-$(CONFIG_DMA_MAP_BENCHMARK)+= map_benchmark.o
diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
new file mode 100644
index ..dc4e5ff48a2d
--- /dev/null
+++ b/kernel/dma/map_benchmark.c
@@ -0,0 +1,296 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Hisilicon Limited.
+ */
+
+#define pr_fmt(fmt)KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DMA_MAP_BENCHMARK  _IOWR('d', 1, struct map_benchmark)
+#define DMA_MAP_MAX_THREADS1024
+#define DMA_MAP_MAX_SECONDS300
+
+struct map_benchmark {
+   __u64 avg_map_100ns; /* average map latency in 100ns */
+   __u64 map_stddev; /* standard deviation of map latency */
+   __u64 avg_unmap_100ns; /* as above */
+   __u64 unmap_stddev;
+   __u32 threads; /* how many threads will do map/unmap in parallel */
+   __u32 seconds; /* how long the test will last */
+   int node; /* which numa node this benchmark will run on */
+   __u64 expansion[10];/* For future use */
+};
+
+struct map_benchmark_data {
+   struct map_benchmark bparam;
+   struct device *dev;
+   struct dentry  *debugfs;
+   atomic64_t sum_map_100ns;
+   atomic64_t sum_unmap_100ns;
+   atomic64_t sum_square_map;
+   atomic64_t sum_square_unmap;
+   atomic64_t loops;
+};
+
+static int map_benchmark_thread(void *data)
+{
+   void *buf;
+   dma_addr_t dma_addr;
+   struct map_benchmark_data *map = data;
+   int ret = 0;
+
+   buf = (void *)__get_free_page(GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+
+   while (!kthread_should_stop())  {
+   __u64 map_100ns, unmap_100ns, map_square, unmap_square;
+   ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
+
+   /*
+* for a non-coher

[PATCH v3 0/2] dma-mapping: provide a benchmark for streaming DMA mapping

2020-11-02 Thread Barry Song
Nowadays, there are increasing requirements to benchmark the performance
of dma_map and dma_unmap particually while the device is attached to an
IOMMU.

This patchset provides the benchmark infrastruture for streaming DMA
mapping. The architecture of the code is pretty much similar with GUP
benchmark:
* mm/gup_benchmark.c provides kernel interface;
* tools/testing/selftests/vm/gup_benchmark.c provides user program to
call the interface provided by mm/gup_benchmark.c.

In our case, kernel/dma/map_benchmark.c is like mm/gup_benchmark.c;
tools/testing/selftests/dma/dma_map_benchmark.c is like tools/testing/
selftests/vm/gup_benchmark.c

A major difference with GUP benchmark is DMA_MAP benchmark needs to run
on a device. Considering one board with below devices and IOMMUs
device A  --- IOMMU 1
device B  --- IOMMU 2
device C  --- non-IOMMU

Different devices might attach to different IOMMU or non-IOMMU. To make
benchmark run, we can either
* create a virtual device and hack the kernel code to attach the virtual
device to IOMMU1, IOMMU2 or non-IOMMU.
* use the existing driver_override mechinism, unbind device A,B, OR c from
their original driver and bind A to dma_map_benchmark platform driver or
pci driver for benchmarking.

In this patchset, I prefer to use the driver_override and avoid the ugly
hack in kernel. We can dynamically switch device behind different IOMMUs
to get the performance of IOMMU or non-IOMMU.

-v3:
  * fix build issues reported by 0day kernel test robot
-v2:
  * add PCI support; v1 supported platform devices only
  * replace ssleep by msleep_interruptible() to permit users to exit
benchmark before it is completed
  * many changes according to Robin's suggestions, thanks! Robin
- add standard deviation output to reflect the worst case
- check users' parameters strictly like the number of threads
- make cache dirty before dma_map
- fix unpaired dma_map_page and dma_unmap_single;
- remove redundant "long long" before ktime_to_ns();
- use devm_add_action()

Barry Song (2):
  dma-mapping: add benchmark support for streaming DMA APIs
  selftests/dma: add test application for DMA_MAP_BENCHMARK

 MAINTAINERS   |   6 +
 kernel/dma/Kconfig|   8 +
 kernel/dma/Makefile   |   1 +
 kernel/dma/map_benchmark.c| 296 ++
 tools/testing/selftests/dma/Makefile  |   6 +
 tools/testing/selftests/dma/config|   1 +
 .../testing/selftests/dma/dma_map_benchmark.c |  87 +
 7 files changed, 405 insertions(+)
 create mode 100644 kernel/dma/map_benchmark.c
 create mode 100644 tools/testing/selftests/dma/Makefile
 create mode 100644 tools/testing/selftests/dma/config
 create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 2/2] selftests/dma: add test application for DMA_MAP_BENCHMARK

2020-11-02 Thread Barry Song
This patch provides the test application for DMA_MAP_BENCHMARK.

Before running the test application, we need to bind a device to dma_map_
benchmark driver. For example, unbind "xxx" from its original driver and
bind to dma_map_benchmark:

echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
echo xxx > /sys/bus/platform/drivers/xxx/unbind
echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind

Another example for PCI devices:
echo dma_map_benchmark > /sys/bus/pci/devices/:00:01.0/driver_override
echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind
echo :00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind

The below command will run 16 threads on numa node 0 for 10 seconds on
the device bound to dma_map_benchmark platform_driver or pci_driver:
./dma_map_benchmark -t 16 -s 10 -n 0
dma mapping benchmark: threads:16 seconds:10
average map latency(us):1.1 standard deviation:1.9
average unmap latency(us):0.5 standard deviation:0.8

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Shuah Khan 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Signed-off-by: Barry Song 
---
 MAINTAINERS   |  6 ++
 tools/testing/selftests/dma/Makefile  |  6 ++
 tools/testing/selftests/dma/config|  1 +
 .../testing/selftests/dma/dma_map_benchmark.c | 87 +++
 4 files changed, 100 insertions(+)
 create mode 100644 tools/testing/selftests/dma/Makefile
 create mode 100644 tools/testing/selftests/dma/config
 create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 608fc8484c02..a1e38d5e14f6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5247,6 +5247,12 @@ F:   include/linux/dma-mapping.h
 F: include/linux/dma-map-ops.h
 F: kernel/dma/
 
+DMA MAPPING BENCHMARK
+M: Barry Song 
+L: iommu@lists.linux-foundation.org
+F: kernel/dma/map_benchmark.c
+F: tools/testing/selftests/dma/
+
 DMA-BUF HEAPS FRAMEWORK
 M: Sumit Semwal 
 R: Benjamin Gaignard 
diff --git a/tools/testing/selftests/dma/Makefile 
b/tools/testing/selftests/dma/Makefile
new file mode 100644
index ..aa8e8b5b3864
--- /dev/null
+++ b/tools/testing/selftests/dma/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+CFLAGS += -I../../../../usr/include/
+
+TEST_GEN_PROGS := dma_map_benchmark
+
+include ../lib.mk
diff --git a/tools/testing/selftests/dma/config 
b/tools/testing/selftests/dma/config
new file mode 100644
index ..6102ee3c43cd
--- /dev/null
+++ b/tools/testing/selftests/dma/config
@@ -0,0 +1 @@
+CONFIG_DMA_MAP_BENCHMARK=y
diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c 
b/tools/testing/selftests/dma/dma_map_benchmark.c
new file mode 100644
index ..4778df0c458f
--- /dev/null
+++ b/tools/testing/selftests/dma/dma_map_benchmark.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Hisilicon Limited.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DMA_MAP_BENCHMARK  _IOWR('d', 1, struct map_benchmark)
+#define DMA_MAP_MAX_THREADS1024
+#define DMA_MAP_MAX_SECONDS 300
+
+struct map_benchmark {
+   __u64 avg_map_100ns; /* average map latency in 100ns */
+   __u64 map_stddev; /* standard deviation of map latency */
+   __u64 avg_unmap_100ns; /* as above */
+   __u64 unmap_stddev;
+   __u32 threads; /* how many threads will do map/unmap in parallel */
+   __u32 seconds; /* how long the test will last */
+   int node; /* which numa node this benchmark will run on */
+   __u64 expansion[10];/* For future use */
+};
+
+int main(int argc, char **argv)
+{
+   struct map_benchmark map;
+   int fd, opt;
+   /* default single thread, run 20 seconds on NUMA_NO_NODE */
+   int threads = 1, seconds = 20, node = -1;
+   int cmd = DMA_MAP_BENCHMARK;
+   char *p;
+
+   while ((opt = getopt(argc, argv, "t:s:n:")) != -1) {
+   switch (opt) {
+   case 't':
+   threads = atoi(optarg);
+   break;
+   case 's':
+   seconds = atoi(optarg);
+   break;
+   case 'n':
+   node = atoi(optarg);
+   break;
+   default:
+   return -1;
+   }
+   }
+
+   if (threads <= 0 || threads > DMA_MAP_MAX_THREADS) {
+   fprintf(stderr, "invalid number of threads, must be in 1-%d\n",
+   DMA_MAP_MAX_THREADS);
+   exit(1);
+   }
+
+   if (seconds <= 0 || seconds > DMA_MAP_MAX_SECONDS) {
+   fprintf(stderr, "invalid number of seconds, must be in 1-%d\n",
+   DMA_MAP_MAX_SECONDS);
+   exit(1);
+   }
+
+   fd = open("/sys/kernel/debug/

[PATCH v2 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-11-01 Thread Barry Song
Nowadays, there are increasing requirements to benchmark the performance
of dma_map and dma_unmap particually while the device is attached to an
IOMMU.

This patch enables the support. Users can run specified number of threads
to do dma_map_page and dma_unmap_page on a specific NUMA node with the
specified duration. Then dma_map_benchmark will calculate the average
latency for map and unmap.

A difficulity for this benchmark is that dma_map/unmap APIs must run on
a particular device. Each device might have different backend of IOMMU or
non-IOMMU.

So we use the driver_override to bind dma_map_benchmark to a particual
device by:
For platform devices:
echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
echo xxx > /sys/bus/platform/drivers/xxx/unbind
echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind

For PCI devices:
echo dma_map_benchmark > /sys/bus/pci/devices/:00:01.0/driver_override
echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind
echo :00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Shuah Khan 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Signed-off-by: Barry Song 
---
-v2:
  * add PCI support; v1 supported platform devices only
  * replace ssleep by msleep_interruptible() to permit users to exit
benchmark before it is completed
  * many changes according to Robin's suggestions, thanks! Robin
- add standard deviation output to reflect the worst case
- check users' parameters strictly like the number of threads
- make cache dirty before dma_map
- fix unpaired dma_map_page and dma_unmap_single;
- remove redundant "long long" before ktime_to_ns();
- use devm_add_action();
- wakeup all threads together after they are ready

 kernel/dma/Kconfig |   8 +
 kernel/dma/Makefile|   1 +
 kernel/dma/map_benchmark.c | 295 +
 3 files changed, 304 insertions(+)
 create mode 100644 kernel/dma/map_benchmark.c

diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index c99de4a21458..949c53da5991 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG
  is technically out-of-spec.
 
  If unsure, say N.
+
+config DMA_MAP_BENCHMARK
+   bool "Enable benchmarking of streaming DMA mapping"
+   help
+ Provides /sys/kernel/debug/dma_map_benchmark that helps with testing
+ performance of dma_(un)map_page.
+
+ See tools/testing/selftests/dma/dma_map_benchmark.c
diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile
index dc755ab68aab..7aa6b26b1348 100644
--- a/kernel/dma/Makefile
+++ b/kernel/dma/Makefile
@@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG)   += debug.o
 obj-$(CONFIG_SWIOTLB)  += swiotlb.o
 obj-$(CONFIG_DMA_COHERENT_POOL)+= pool.o
 obj-$(CONFIG_DMA_REMAP)+= remap.o
+obj-$(CONFIG_DMA_MAP_BENCHMARK)+= map_benchmark.o
diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
new file mode 100644
index ..ac397758087b
--- /dev/null
+++ b/kernel/dma/map_benchmark.c
@@ -0,0 +1,295 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Hisilicon Limited.
+ */
+
+#define pr_fmt(fmt)KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DMA_MAP_BENCHMARK  _IOWR('d', 1, struct map_benchmark)
+#define DMA_MAP_MAX_THREADS1024
+#define DMA_MAP_MAX_SECONDS300
+
+struct map_benchmark {
+   __u64 avg_map_100ns; /* average map latency in 100ns */
+   __u64 map_stddev; /* standard deviation of map latency */
+   __u64 avg_unmap_100ns; /* as above */
+   __u64 unmap_stddev;
+   __u32 threads; /* how many threads will do map/unmap in parallel */
+   __u32 seconds; /* how long the test will last */
+   int node; /* which numa node this benchmark will run on */
+   __u64 expansion[10];/* For future use */
+};
+
+struct map_benchmark_data {
+   struct map_benchmark bparam;
+   struct device *dev;
+   struct dentry  *debugfs;
+   atomic64_t sum_map_100ns;
+   atomic64_t sum_unmap_100ns;
+   atomic64_t sum_square_map;
+   atomic64_t sum_square_unmap;
+   atomic64_t loops;
+};
+
+static int map_benchmark_thread(void *data)
+{
+   void *buf;
+   dma_addr_t dma_addr;
+   struct map_benchmark_data *map = data;
+   int ret = 0;
+
+   buf = (void *)__get_free_page(GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+
+   while (!kthread_should_stop())  {
+   __u64 map_100ns, unmap_100ns, map_square, unmap_square;
+   ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
+
+   /*
+* for a non-coherent device, if we don't stain them 

[PATCH v2 0/2] dma-mapping: provide a benchmark for streaming DMA mapping

2020-11-01 Thread Barry Song
Nowadays, there are increasing requirements to benchmark the performance
of dma_map and dma_unmap particually while the device is attached to an
IOMMU.

This patchset provides the benchmark infrastruture for streaming DMA
mapping. The architecture of the code is pretty much similar with GUP
benchmark:
* mm/gup_benchmark.c provides kernel interface;
* tools/testing/selftests/vm/gup_benchmark.c provides user program to
call the interface provided by mm/gup_benchmark.c.

In our case, kernel/dma/map_benchmark.c is like mm/gup_benchmark.c;
tools/testing/selftests/dma/dma_map_benchmark.c is like tools/testing/
selftests/vm/gup_benchmark.c

A major difference with GUP benchmark is DMA_MAP benchmark needs to run
on a device. Considering one board with below devices and IOMMUs
device A  --- IOMMU 1
device B  --- IOMMU 2
device C  --- non-IOMMU

Different devices might attach to different IOMMU or non-IOMMU. To make
benchmark run, we can either
* create a virtual device and hack the kernel code to attach the virtual
device to IOMMU1, IOMMU2 or non-IOMMU.
* use the existing driver_override mechinism, unbind device A,B, OR c from
their original driver and bind A to dma_map_benchmark platform driver or
pci driver for benchmarking.

In this patchset, I prefer to use the driver_override and avoid the ugly
hack in kernel. We can dynamically switch device behind different IOMMUs
to get the performance of IOMMU or non-IOMMU.

-v2:
  * add PCI support; v1 supported platform devices only
  * replace ssleep by msleep_interruptible() to permit users to exit
benchmark before it is completed
  * many changes according to Robin's suggestions, thanks! Robin
- add standard deviation output to reflect the worst case
- check users' parameters strictly like the number of threads
- make cache dirty before dma_map
- fix unpaired dma_map_page and dma_unmap_single;
- remove redundant "long long" before ktime_to_ns();
- use devm_add_action();
- wakeup all threads together after they are ready

Barry Song (2):
  dma-mapping: add benchmark support for streaming DMA APIs
  selftests/dma: add test application for DMA_MAP_BENCHMARK

 MAINTAINERS   |   6 +
 kernel/dma/Kconfig|   8 +
 kernel/dma/Makefile   |   1 +
 kernel/dma/map_benchmark.c| 295 ++
 tools/testing/selftests/dma/Makefile  |   6 +
 tools/testing/selftests/dma/config|   1 +
 .../testing/selftests/dma/dma_map_benchmark.c |  87 ++
 7 files changed, 404 insertions(+)
 create mode 100644 kernel/dma/map_benchmark.c
 create mode 100644 tools/testing/selftests/dma/Makefile
 create mode 100644 tools/testing/selftests/dma/config
 create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 2/2] selftests/dma: add test application for DMA_MAP_BENCHMARK

2020-11-01 Thread Barry Song
This patch provides the test application for DMA_MAP_BENCHMARK.

Before running the test application, we need to bind a device to dma_map_
benchmark driver. For example, unbind "xxx" from its original driver and
bind to dma_map_benchmark:

echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
echo xxx > /sys/bus/platform/drivers/xxx/unbind
echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind

Another example for PCI devices:
echo dma_map_benchmark > /sys/bus/pci/devices/:00:01.0/driver_override
echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind
echo :00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind

The below command will run 16 threads on numa node 0 for 10 seconds on
the device bound to dma_map_benchmark platform_driver or pci_driver:
./dma_map_benchmark -t 16 -s 10 -n 0
dma mapping benchmark: threads:16 seconds:10
average map latency(us):1.1 standard deviation:1.9
average unmap latency(us):0.5 standard deviation:0.8

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Shuah Khan 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Signed-off-by: Barry Song 
---
 -v2:
 * check parameters like threads, seconds strictly
 * print standard deviation for latencies

 MAINTAINERS   |  6 ++
 tools/testing/selftests/dma/Makefile  |  6 ++
 tools/testing/selftests/dma/config|  1 +
 .../testing/selftests/dma/dma_map_benchmark.c | 87 +++
 4 files changed, 100 insertions(+)
 create mode 100644 tools/testing/selftests/dma/Makefile
 create mode 100644 tools/testing/selftests/dma/config
 create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 608fc8484c02..a1e38d5e14f6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5247,6 +5247,12 @@ F:   include/linux/dma-mapping.h
 F: include/linux/dma-map-ops.h
 F: kernel/dma/
 
+DMA MAPPING BENCHMARK
+M: Barry Song 
+L: iommu@lists.linux-foundation.org
+F: kernel/dma/map_benchmark.c
+F: tools/testing/selftests/dma/
+
 DMA-BUF HEAPS FRAMEWORK
 M: Sumit Semwal 
 R: Benjamin Gaignard 
diff --git a/tools/testing/selftests/dma/Makefile 
b/tools/testing/selftests/dma/Makefile
new file mode 100644
index ..aa8e8b5b3864
--- /dev/null
+++ b/tools/testing/selftests/dma/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+CFLAGS += -I../../../../usr/include/
+
+TEST_GEN_PROGS := dma_map_benchmark
+
+include ../lib.mk
diff --git a/tools/testing/selftests/dma/config 
b/tools/testing/selftests/dma/config
new file mode 100644
index ..6102ee3c43cd
--- /dev/null
+++ b/tools/testing/selftests/dma/config
@@ -0,0 +1 @@
+CONFIG_DMA_MAP_BENCHMARK=y
diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c 
b/tools/testing/selftests/dma/dma_map_benchmark.c
new file mode 100644
index ..4778df0c458f
--- /dev/null
+++ b/tools/testing/selftests/dma/dma_map_benchmark.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Hisilicon Limited.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DMA_MAP_BENCHMARK  _IOWR('d', 1, struct map_benchmark)
+#define DMA_MAP_MAX_THREADS1024
+#define DMA_MAP_MAX_SECONDS 300
+
+struct map_benchmark {
+   __u64 avg_map_100ns; /* average map latency in 100ns */
+   __u64 map_stddev; /* standard deviation of map latency */
+   __u64 avg_unmap_100ns; /* as above */
+   __u64 unmap_stddev;
+   __u32 threads; /* how many threads will do map/unmap in parallel */
+   __u32 seconds; /* how long the test will last */
+   int node; /* which numa node this benchmark will run on */
+   __u64 expansion[10];/* For future use */
+};
+
+int main(int argc, char **argv)
+{
+   struct map_benchmark map;
+   int fd, opt;
+   /* default single thread, run 20 seconds on NUMA_NO_NODE */
+   int threads = 1, seconds = 20, node = -1;
+   int cmd = DMA_MAP_BENCHMARK;
+   char *p;
+
+   while ((opt = getopt(argc, argv, "t:s:n:")) != -1) {
+   switch (opt) {
+   case 't':
+   threads = atoi(optarg);
+   break;
+   case 's':
+   seconds = atoi(optarg);
+   break;
+   case 'n':
+   node = atoi(optarg);
+   break;
+   default:
+   return -1;
+   }
+   }
+
+   if (threads <= 0 || threads > DMA_MAP_MAX_THREADS) {
+   fprintf(stderr, "invalid number of threads, must be in 1-%d\n",
+   DMA_MAP_MAX_THREADS);
+   exit(1);
+   }
+
+   if (seconds <= 0 || seconds > DMA_MAP_MAX_SECONDS) {
+   fprintf(stderr, "invalid number of seconds, must be in 1-%d\n",
+  

RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-10-31 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song) [mailto:song.bao@hisilicon.com]
> Sent: Saturday, October 31, 2020 10:45 PM
> To: Robin Murphy ;
> iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com
> Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm
> ; linux-kselft...@vger.kernel.org
> Subject: RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming
> DMA APIs
> 
> 
> 
> > -Original Message-
> > From: Robin Murphy [mailto:robin.mur...@arm.com]
> > Sent: Saturday, October 31, 2020 4:48 AM
> > To: Song Bao Hua (Barry Song) ;
> > iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com
> > Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm
> > ; linux-kselft...@vger.kernel.org
> > Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for
> streaming
> > DMA APIs
> >
> > On 2020-10-29 21:39, Song Bao Hua (Barry Song) wrote:
> > [...]
> > >>> +struct map_benchmark {
> > >>> +   __u64 map_nsec;
> > >>> +   __u64 unmap_nsec;
> > >>> +   __u32 threads; /* how many threads will do map/unmap in parallel
> > */
> > >>> +   __u32 seconds; /* how long the test will last */
> > >>> +   int node; /* which numa node this benchmark will run on */
> > >>> +   __u64 expansion[10];/* For future use */
> > >>> +};
> > >>
> > >> I'm no expert on userspace ABIs (and what little experience I do have
> > >> is mostly of Win32...), so hopefully someone else will comment if
> > >> there's anything of concern here. One thing I wonder is that there's
> > >> a fair likelihood of functionality evolving here over time, so might
> > >> it be appropriate to have some sort of explicit versioning parameter
> > >> for robustness?
> > >
> > > I copied that from gup_benchmark. There is no this kind of code to
> > > compare version.
> > > I believe there is a likelihood that kernel module is changed but
> > > users are still using old userspace tool, this might lead to the
> > > incompatible data structure.
> > > But not sure if it is a big problem :-)
> >
> > Yeah, like I say I don't really have a good feeling for what would be best 
> > here,
> > I'm just thinking of what I do know and wary of the potential for a "640 
> > bits
> > ought to be enough for anyone" issue ;)
> >
> > >>> +struct map_benchmark_data {
> > >>> +   struct map_benchmark bparam;
> > >>> +   struct device *dev;
> > >>> +   struct dentry  *debugfs;
> > >>> +   atomic64_t total_map_nsecs;
> > >>> +   atomic64_t total_map_loops;
> > >>> +   atomic64_t total_unmap_nsecs;
> > >>> +   atomic64_t total_unmap_loops;
> > >>> +};
> > >>> +
> > >>> +static int map_benchmark_thread(void *data) {
> > >>> +   struct page *page;
> > >>> +   dma_addr_t dma_addr;
> > >>> +   struct map_benchmark_data *map = data;
> > >>> +   int ret = 0;
> > >>> +
> > >>> +   page = alloc_page(GFP_KERNEL);
> > >>> +   if (!page)
> > >>> +   return -ENOMEM;
> > >>> +
> > >>> +   while (!kthread_should_stop())  {
> > >>> +   ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
> > >>> +
> > >>> +   map_stime = ktime_get();
> > >>> +   dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE,
> > >> DMA_BIDIRECTIONAL);
> > >>
> > >> Note that for a non-coherent device, this will give an underestimate
> > >> of the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings,
> > >> since the page will never be dirty in the cache (except possibly the
> > >> very first time through).
> > >
> > > Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1"
> > > after we have this basic framework.
> >
> > That wasn't so much about the direction itself, just that if it's anything 
> > other
> > than FROM_DEVICE, we should probably do something to dirty the buffer by
> a
> > reasonable amount before each map. Otherwise the measured performance
> is
> > going to be unrealistic on many systems.
> 
> Maybe pu

RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-10-31 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Saturday, October 31, 2020 4:48 AM
> To: Song Bao Hua (Barry Song) ;
> iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com
> Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm
> ; linux-kselft...@vger.kernel.org
> Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for streaming
> DMA APIs
> 
> On 2020-10-29 21:39, Song Bao Hua (Barry Song) wrote:
> [...]
> >>> +struct map_benchmark {
> >>> + __u64 map_nsec;
> >>> + __u64 unmap_nsec;
> >>> + __u32 threads; /* how many threads will do map/unmap in parallel
> */
> >>> + __u32 seconds; /* how long the test will last */
> >>> + int node; /* which numa node this benchmark will run on */
> >>> + __u64 expansion[10];/* For future use */
> >>> +};
> >>
> >> I'm no expert on userspace ABIs (and what little experience I do have
> >> is mostly of Win32...), so hopefully someone else will comment if
> >> there's anything of concern here. One thing I wonder is that there's
> >> a fair likelihood of functionality evolving here over time, so might
> >> it be appropriate to have some sort of explicit versioning parameter
> >> for robustness?
> >
> > I copied that from gup_benchmark. There is no this kind of code to
> > compare version.
> > I believe there is a likelihood that kernel module is changed but
> > users are still using old userspace tool, this might lead to the
> > incompatible data structure.
> > But not sure if it is a big problem :-)
> 
> Yeah, like I say I don't really have a good feeling for what would be best 
> here,
> I'm just thinking of what I do know and wary of the potential for a "640 bits
> ought to be enough for anyone" issue ;)
> 
> >>> +struct map_benchmark_data {
> >>> + struct map_benchmark bparam;
> >>> + struct device *dev;
> >>> + struct dentry  *debugfs;
> >>> + atomic64_t total_map_nsecs;
> >>> + atomic64_t total_map_loops;
> >>> + atomic64_t total_unmap_nsecs;
> >>> + atomic64_t total_unmap_loops;
> >>> +};
> >>> +
> >>> +static int map_benchmark_thread(void *data) {
> >>> + struct page *page;
> >>> + dma_addr_t dma_addr;
> >>> + struct map_benchmark_data *map = data;
> >>> + int ret = 0;
> >>> +
> >>> + page = alloc_page(GFP_KERNEL);
> >>> + if (!page)
> >>> + return -ENOMEM;
> >>> +
> >>> + while (!kthread_should_stop())  {
> >>> + ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
> >>> +
> >>> + map_stime = ktime_get();
> >>> + dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE,
> >> DMA_BIDIRECTIONAL);
> >>
> >> Note that for a non-coherent device, this will give an underestimate
> >> of the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings,
> >> since the page will never be dirty in the cache (except possibly the
> >> very first time through).
> >
> > Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1"
> > after we have this basic framework.
> 
> That wasn't so much about the direction itself, just that if it's anything 
> other
> than FROM_DEVICE, we should probably do something to dirty the buffer by a
> reasonable amount before each map. Otherwise the measured performance is
> going to be unrealistic on many systems.

Maybe put a memset(buf, 0, PAGE_SIZE) before dma_map will help ?

> 
> [...]
> >>> + atomic64_add((long long)ktime_to_ns(ktime_sub(unmap_etime,
> >> unmap_stime)),
> >>> + >total_unmap_nsecs);
> >>> + atomic64_inc(>total_map_loops);
> >>> + atomic64_inc(>total_unmap_loops);
> >>
> >> I think it would be worth keeping track of the variances as well - it
> >> can be hard to tell if a reasonable-looking average is hiding
> >> terrible worst-case behaviour.
> >
> > This is a sensible requirement. I believe it is better to be handled
> > by the existing kernel tracing method.
> >
> > Maybe we need a histogram like:
> > Delay   sample count
> > 1-2us   1000  ***
> > 2-3us   2000  ***
> > 3-4us   100   *
> > .
> > This will be more precise than the maximum latency in the worst case.
> &g

RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-10-29 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Friday, October 30, 2020 8:38 AM
> To: Song Bao Hua (Barry Song) ;
> iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com
> Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm
> ; linux-kselft...@vger.kernel.org
> Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for streaming
> DMA APIs
> 
> On 2020-10-27 03:53, Barry Song wrote:
> > Nowadays, there are increasing requirements to benchmark the performance
> > of dma_map and dma_unmap particually while the device is attached to an
> > IOMMU.
> >
> > This patch enables the support. Users can run specified number of threads
> > to do dma_map_page and dma_unmap_page on a specific NUMA node with
> the
> > specified duration. Then dma_map_benchmark will calculate the average
> > latency for map and unmap.
> >
> > A difficulity for this benchmark is that dma_map/unmap APIs must run on
> > a particular device. Each device might have different backend of IOMMU or
> > non-IOMMU.
> >
> > So we use the driver_override to bind dma_map_benchmark to a particual
> > device by:
> > echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
> > echo xxx > /sys/bus/platform/drivers/xxx/unbind
> > echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
> >
> > For this moment, it supports platform device only, PCI device will also
> > be supported afterwards.
> 
> Neat! This is something I've thought about many times, but never got
> round to attempting :)

I am happy you have the same needs. When I came to IOMMU area a half year ago,
the first thing I've done was writing a rough benchmark. At that time, I hacked 
kernel
to get a device behind an IOMMU.

Recently, I got some time to think about how to get "device" without ugly 
hacking and
then clean up code for sending patches out to provide a common benchmark in 
order
that everybody can use.

> 
> I think the basic latency measurement for mapping and unmapping pages is
> enough to start with, but there are definitely some more things that
> would be interesting to look into for future enhancements:
> 
>   - a choice of mapping sizes, both smaller and larger than one page, to
> help characterise stuff like cache maintenance overhead and bounce
> buffer/IOVA fragmentation.
>   - alternative allocation patterns like doing lots of maps first, then
> all their corresponding unmaps (to provoke things like the worst-case
> IOVA rcache behaviour).
>   - ways to exercise a range of those parameters at once across
> different threads in a single test.
> 

Yes, sure. Once we have a basic framework, we can add more benchmark patterns
by using different parameters in the userspace tool:
testing/selftests/dma/dma_map_benchmark.c

Similar function extensions have been carried out in GUP_BENCHMARK.

> But let's get a basic framework nailed down first...

Sure.

> 
> > Cc: Joerg Roedel 
> > Cc: Will Deacon 
> > Cc: Shuah Khan 
> > Cc: Christoph Hellwig 
> > Cc: Marek Szyprowski 
> > Cc: Robin Murphy 
> > Signed-off-by: Barry Song 
> > ---
> >   kernel/dma/Kconfig |   8 ++
> >   kernel/dma/Makefile|   1 +
> >   kernel/dma/map_benchmark.c | 202
> +
> >   3 files changed, 211 insertions(+)
> >   create mode 100644 kernel/dma/map_benchmark.c
> >
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index c99de4a21458..949c53da5991 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG
> >   is technically out-of-spec.
> >
> >   If unsure, say N.
> > +
> > +config DMA_MAP_BENCHMARK
> > +   bool "Enable benchmarking of streaming DMA mapping"
> > +   help
> > + Provides /sys/kernel/debug/dma_map_benchmark that helps with
> testing
> > + performance of dma_(un)map_page.
> > +
> > + See tools/testing/selftests/dma/dma_map_benchmark.c
> > diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile
> > index dc755ab68aab..7aa6b26b1348 100644
> > --- a/kernel/dma/Makefile
> > +++ b/kernel/dma/Makefile
> > @@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG)   += debug.o
> >   obj-$(CONFIG_SWIOTLB) += swiotlb.o
> >   obj-$(CONFIG_DMA_COHERENT_POOL)   += pool.o
> >   obj-$(CONFIG_DMA_REMAP)   += remap.o
> > +obj-$(CONFIG_DMA_MAP_BENCHMARK)+= map_benchmark.o
> > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchma

RE: [PATCH] dma: Per-NUMA-node CMA should depend on NUMA

2020-10-27 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: h...@lst.de [mailto:h...@lst.de]
> Sent: Tuesday, October 27, 2020 8:55 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Robin Murphy ; h...@lst.de;
> iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org
> Subject: Re: [PATCH] dma: Per-NUMA-node CMA should depend on NUMA
> 
> On Mon, Oct 26, 2020 at 08:07:43PM +0000, Song Bao Hua (Barry Song)
> wrote:
> > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > > index c99de4a21458..964b74c9b7e3 100644
> > > --- a/kernel/dma/Kconfig
> > > +++ b/kernel/dma/Kconfig
> > > @@ -125,7 +125,8 @@ if  DMA_CMA
> > >
> > >  config DMA_PERNUMA_CMA
> > >   bool "Enable separate DMA Contiguous Memory Area for each NUMA
> > > Node"
> > > - default NUMA && ARM64
> > > + depends on NUMA
> > > + default ARM64
> >
> > On the other hand, at this moment, only ARM64 is calling the init code
> > to get per_numa cma. Do we need to
> > depends on NUMA && ARM64 ?
> > so that this is not enabled by non-arm64?
> 
> I actually hate having arch symbols in common code.  A new
> ARCH_HAS_DMA_PERNUMA_CMA, only selected by arm64 for now would be
> more
> clean I think.

Sounds good to me.

BTW,  +Will.

Last time we talked about default pernuma cma size, you suggested a bootargs
in arch/arm64/Kconfig but Will seems to have different idea. Am I right, Will?

Would we let aarch64 call dma_pernuma_cma_reserve(16MB) rather than
dma_pernuma_cma_reserve()?

In this way, users will at least get a default pernuma CMA which is required
at least by IOMMU. If users set a "cma_pernuma" bootargs, it will overwrite
the default size from aarch64 code?

I mean

- void __init dma_pernuma_cma_reserve(size_t size)
+ void __init dma_pernuma_cma_reserve(size_t size)
{
if (!pernuma_size_bytes)
+   pernuma_size_bytes = size;

}

Right now, it is easy that users will forget to set cma_pernuma in bootargs.
Probably this feature is not enabled by users.

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 0/2] dma-mapping: provide a benchmark for streaming DMA mapping

2020-10-26 Thread Barry Song
Nowadays, there are increasing requirements to benchmark the performance
of dma_map and dma_unmap particually while the device is attached to an
IOMMU.

This patchset provides the benchmark infrastruture for streaming DMA
mapping. The architecture of the code is pretty much similar with GUP
benchmark:
* mm/gup_benchmark.c provides kernel interface;
* tools/testing/selftests/vm/gup_benchmark.c provides user program to
call the interface provided by mm/gup_benchmark.c.

In our case, kernel/dma/map_benchmark.c is like mm/gup_benchmark.c;
tools/testing/selftests/dma/dma_map_benchmark.c is like tools/testing/
selftests/vm/gup_benchmark.c

A major difference with GUP benchmark is DMA_MAP benchmark needs to run
on a device. Considering one board with below devices and IOMMUs
device A  --- IOMMU 1
device B  --- IOMMU 2
device C  --- non-IOMMU

Different devices might attach to different IOMMU or non-IOMMU. To make
benchmark run, we can either
* create a virtual device and hack the kernel code to attach the virtual
device to IOMMU1, IOMMU2 or non-IOMMU.
* use the existing driver_override mechinism, unbind device A,B, or c from
their original driver and bind them to "dma_map_benchmark" platform_driver
or pci_driver for benchmarking.

In this patchset, I prefer to use the driver_override and avoid the various
hack in kernel. We can dynamically switch devices behind different IOMMUs
to get the performance of dma map on IOMMU or non-IOMMU.

Barry Song (2):
  dma-mapping: add benchmark support for streaming DMA APIs
  selftests/dma: add test application for DMA_MAP_BENCHMARK

 MAINTAINERS   |   6 +
 kernel/dma/Kconfig|   8 +
 kernel/dma/Makefile   |   1 +
 kernel/dma/map_benchmark.c| 202 ++
 tools/testing/selftests/dma/Makefile  |   6 +
 tools/testing/selftests/dma/config|   1 +
 .../testing/selftests/dma/dma_map_benchmark.c |  72 +++
 7 files changed, 296 insertions(+)
 create mode 100644 kernel/dma/map_benchmark.c
 create mode 100644 tools/testing/selftests/dma/Makefile
 create mode 100644 tools/testing/selftests/dma/config
 create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/2] selftests/dma: add test application for DMA_MAP_BENCHMARK

2020-10-26 Thread Barry Song
This patch provides the test application for DMA_MAP_BENCHMARK.

Before running the test application, we need to bind a device to dma_map_
benchmark driver. For example, unbind "xxx" from its original driver and
bind to dma_map_benchmark:

echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
echo xxx > /sys/bus/platform/drivers/xxx/unbind
echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind

Then, run 10 threads on numa node 1 for 10 seconds on device "xxx":
./dma_map_benchmark -t 10 -s 10 -n 1
dma mapping benchmark: average map_nsec:3619 average unmap_nsec:2423

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Shuah Khan 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Signed-off-by: Barry Song 
---
 MAINTAINERS   |  6 ++
 tools/testing/selftests/dma/Makefile  |  6 ++
 tools/testing/selftests/dma/config|  1 +
 .../testing/selftests/dma/dma_map_benchmark.c | 72 +++
 4 files changed, 85 insertions(+)
 create mode 100644 tools/testing/selftests/dma/Makefile
 create mode 100644 tools/testing/selftests/dma/config
 create mode 100644 tools/testing/selftests/dma/dma_map_benchmark.c

diff --git a/MAINTAINERS b/MAINTAINERS
index f310f0a09904..552389874ca2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5220,6 +5220,12 @@ F:   include/linux/dma-mapping.h
 F: include/linux/dma-map-ops.h
 F: kernel/dma/
 
+DMA MAPPING BENCHMARK
+M: Barry Song 
+L: iommu@lists.linux-foundation.org
+F: kernel/dma/map_benchmark.c
+F: tools/testing/selftests/dma/
+
 DMA-BUF HEAPS FRAMEWORK
 M: Sumit Semwal 
 R: Andrew F. Davis 
diff --git a/tools/testing/selftests/dma/Makefile 
b/tools/testing/selftests/dma/Makefile
new file mode 100644
index ..aa8e8b5b3864
--- /dev/null
+++ b/tools/testing/selftests/dma/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+CFLAGS += -I../../../../usr/include/
+
+TEST_GEN_PROGS := dma_map_benchmark
+
+include ../lib.mk
diff --git a/tools/testing/selftests/dma/config 
b/tools/testing/selftests/dma/config
new file mode 100644
index ..6102ee3c43cd
--- /dev/null
+++ b/tools/testing/selftests/dma/config
@@ -0,0 +1 @@
+CONFIG_DMA_MAP_BENCHMARK=y
diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c 
b/tools/testing/selftests/dma/dma_map_benchmark.c
new file mode 100644
index ..e03bd03e101e
--- /dev/null
+++ b/tools/testing/selftests/dma/dma_map_benchmark.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Hisilicon Limited.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DMA_MAP_BENCHMARK  _IOWR('d', 1, struct map_benchmark)
+
+struct map_benchmark {
+   __u64 map_nsec;
+   __u64 unmap_nsec;
+   __u32 threads; /* how many threads will do map/unmap in parallel */
+   __u32 seconds; /* how long the test will last */
+   int node; /* which numa node this benchmark will run on */
+   __u64 expansion[10];/* For future use */
+};
+
+int main(int argc, char **argv)
+{
+   struct map_benchmark map;
+   int fd, opt, threads = 0, seconds = 0, node = -1;
+   int cmd = DMA_MAP_BENCHMARK;
+   char *p;
+
+   while ((opt = getopt(argc, argv, "t:s:n:")) != -1) {
+   switch (opt) {
+   case 't':
+   threads = atoi(optarg);
+   break;
+   case 's':
+   seconds = atoi(optarg);
+   break;
+   case 'n':
+   node = atoi(optarg);
+   break;
+   default:
+   return -1;
+   }
+   }
+
+   if (threads <= 0 || seconds <= 0) {
+   perror("invalid number of threads or seconds");
+   exit(1);
+   }
+
+   fd = open("/sys/kernel/debug/dma_map_benchmark", O_RDWR);
+   if (fd == -1) {
+   perror("open");
+   exit(1);
+   }
+
+   map.seconds = seconds;
+   map.threads = threads;
+   map.node = node;
+   if (ioctl(fd, cmd, )) {
+   perror("ioctl");
+   exit(1);
+   }
+
+   printf("dma mapping benchmark: average map_nsec:%lld average 
unmap_nsec:%lld\n",
+   map.map_nsec,
+   map.unmap_nsec);
+
+   return 0;
+}
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-10-26 Thread Barry Song
Nowadays, there are increasing requirements to benchmark the performance
of dma_map and dma_unmap particually while the device is attached to an
IOMMU.

This patch enables the support. Users can run specified number of threads
to do dma_map_page and dma_unmap_page on a specific NUMA node with the
specified duration. Then dma_map_benchmark will calculate the average
latency for map and unmap.

A difficulity for this benchmark is that dma_map/unmap APIs must run on
a particular device. Each device might have different backend of IOMMU or
non-IOMMU.

So we use the driver_override to bind dma_map_benchmark to a particual
device by:
echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
echo xxx > /sys/bus/platform/drivers/xxx/unbind
echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind

For this moment, it supports platform device only, PCI device will also
be supported afterwards.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Shuah Khan 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Signed-off-by: Barry Song 
---
 kernel/dma/Kconfig |   8 ++
 kernel/dma/Makefile|   1 +
 kernel/dma/map_benchmark.c | 202 +
 3 files changed, 211 insertions(+)
 create mode 100644 kernel/dma/map_benchmark.c

diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index c99de4a21458..949c53da5991 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG
  is technically out-of-spec.
 
  If unsure, say N.
+
+config DMA_MAP_BENCHMARK
+   bool "Enable benchmarking of streaming DMA mapping"
+   help
+ Provides /sys/kernel/debug/dma_map_benchmark that helps with testing
+ performance of dma_(un)map_page.
+
+ See tools/testing/selftests/dma/dma_map_benchmark.c
diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile
index dc755ab68aab..7aa6b26b1348 100644
--- a/kernel/dma/Makefile
+++ b/kernel/dma/Makefile
@@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG)   += debug.o
 obj-$(CONFIG_SWIOTLB)  += swiotlb.o
 obj-$(CONFIG_DMA_COHERENT_POOL)+= pool.o
 obj-$(CONFIG_DMA_REMAP)+= remap.o
+obj-$(CONFIG_DMA_MAP_BENCHMARK)+= map_benchmark.o
diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
new file mode 100644
index ..16a5d7779d67
--- /dev/null
+++ b/kernel/dma/map_benchmark.c
@@ -0,0 +1,202 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Hisilicon Limited.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DMA_MAP_BENCHMARK  _IOWR('d', 1, struct map_benchmark)
+
+struct map_benchmark {
+   __u64 map_nsec;
+   __u64 unmap_nsec;
+   __u32 threads; /* how many threads will do map/unmap in parallel */
+   __u32 seconds; /* how long the test will last */
+   int node; /* which numa node this benchmark will run on */
+   __u64 expansion[10];/* For future use */
+};
+
+struct map_benchmark_data {
+   struct map_benchmark bparam;
+   struct device *dev;
+   struct dentry  *debugfs;
+   atomic64_t total_map_nsecs;
+   atomic64_t total_map_loops;
+   atomic64_t total_unmap_nsecs;
+   atomic64_t total_unmap_loops;
+};
+
+static int map_benchmark_thread(void *data)
+{
+   struct page *page;
+   dma_addr_t dma_addr;
+   struct map_benchmark_data *map = data;
+   int ret = 0;
+
+   page = alloc_page(GFP_KERNEL);
+   if (!page)
+   return -ENOMEM;
+
+   while (!kthread_should_stop())  {
+   ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
+
+   map_stime = ktime_get();
+   dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE, 
DMA_BIDIRECTIONAL);
+   if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
+   dev_err(map->dev, "dma_map_page failed\n");
+   ret = -ENOMEM;
+   goto out;
+   }
+   map_etime = ktime_get();
+
+   unmap_stime = ktime_get();
+   dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, 
DMA_BIDIRECTIONAL);
+   unmap_etime = ktime_get();
+
+   atomic64_add((long long)ktime_to_ns(ktime_sub(map_etime, 
map_stime)),
+   >total_map_nsecs);
+   atomic64_add((long long)ktime_to_ns(ktime_sub(unmap_etime, 
unmap_stime)),
+   >total_unmap_nsecs);
+   atomic64_inc(>total_map_loops);
+   atomic64_inc(>total_unmap_loops);
+   }
+
+out:
+   __free_page(page);
+   return ret;
+}
+
+static int do_map_benchmark(struct map_benchmark_data *map)
+{
+   struct task_struct **tsk;
+   int threads = map->bparam.threads;
+   int no

RE: [PATCH] dma: Per-NUMA-node CMA should depend on NUMA

2020-10-26 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Tuesday, October 27, 2020 1:25 AM
> To: h...@lst.de
> Cc: iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org; Song Bao
> Hua (Barry Song) 
> Subject: [PATCH] dma: Per-NUMA-node CMA should depend on NUMA
> 
> Offering DMA_PERNUMA_CMA to non-NUMA configs is pointless.
> 

This is right.

> Signed-off-by: Robin Murphy 
> ---
>  kernel/dma/Kconfig | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> index c99de4a21458..964b74c9b7e3 100644
> --- a/kernel/dma/Kconfig
> +++ b/kernel/dma/Kconfig
> @@ -125,7 +125,8 @@ if  DMA_CMA
> 
>  config DMA_PERNUMA_CMA
>   bool "Enable separate DMA Contiguous Memory Area for each NUMA
> Node"
> - default NUMA && ARM64
> + depends on NUMA
> + default ARM64

On the other hand, at this moment, only ARM64 is calling the init code
to get per_numa cma. Do we need to
depends on NUMA && ARM64 ?
so that this is not enabled by non-arm64?

>   help
> Enable this option to get pernuma CMA areas so that devices like
> ARM64 SMMU can get local memory by DMA coherent APIs.
> --
 
Thanks
Barry


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency

2020-09-01 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of John Garry
> Sent: Saturday, August 22, 2020 1:54 AM
> To: w...@kernel.org; robin.mur...@arm.com
> Cc: j...@8bytes.org; linux-arm-ker...@lists.infradead.org;
> iommu@lists.linux-foundation.org; m...@kernel.org; Linuxarm
> ; linux-ker...@vger.kernel.org; John Garry
> 
> Subject: [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency
> 
> As mentioned in [0], the CPU may consume many cycles processing
> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> get space on the queue takes a lot of time once we start getting many CPUs
> contending - from experiment, for 64 CPUs contending the cmdq, success rate
> is ~ 1 in 12, which is poor, but not totally awful.
> 
> This series removes that cmpxchg() and replaces with an atomic_add, same as
> how the actual cmdq deals with maintaining the prod pointer.
> 
> For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput
> increase:
> Before: 1250K IOPs
> After: 1550K IOPs
> 
> I also have a test harness to check the rate of DMA map+unmaps we can
> achieve:
> 
> CPU count 8   16  32  64
> Before:   282K115K36K 11K
> After:302K193K80K 30K
> 
> (unit is map+unmaps per CPU per second)

I have seen performance improvement on hns3 network by sending UDP with 1-32 
threads:

Threads number14   8 16   32
Before patch(TX Mbps)  7636.05  16444.36  21694.48  25746.40   25295.93
After  patch(TX Mbps)  7711.60  16478.98  26561.06  32628.75   33764.56

As you can see, for 8,16,32 threads, network TX throughput improve much. For 1 
and 4 threads,
Tx throughput is almost seem before and after patch. This should be sensible as 
this patch
is mainly for decreasing the lock contention.

> 
> [0]
> https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD2
> 4b8...@dggemi525-mbs.china.huawei.com/T/#ma02e301c38c3e94b7725e
> 685757c27e39c7cbde3
> 
> Differences to v1:
> - Simplify by dropping patch to always issue a CMD_SYNC
> - Use 64b atomic add, keeping prod in a separate 32b field
> 
> John Garry (2):
>   iommu/arm-smmu-v3: Calculate max commands per batch
>   iommu/arm-smmu-v3: Remove cmpxchg() in
> arm_smmu_cmdq_issue_cmdlist()
> 
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 166
> ++--
>  1 file changed, 114 insertions(+), 52 deletions(-)
> 
> --
> 2.26.2

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v6 0/2] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

2020-08-31 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Friday, August 21, 2020 6:19 PM
> To: Song Bao Hua (Barry Song) 
> Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> w...@kernel.org; ganapatrao.kulka...@cavium.com;
> catalin.mari...@arm.com; iommu@lists.linux-foundation.org; Linuxarm
> ; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; huangdaode 
> Subject: Re: [PATCH v6 0/2] make dma_alloc_coherent NUMA-aware by
> per-NUMA CMA
> 
> FYI, as of the last one I'm fine now, bit I really need an ACK from
> the arm64 maintainers.

Hi Christoph,

For the changes in arch/arm64, Will gave his ack here:
https://lore.kernel.org/linux-iommu/20200821090116.GB20255@willie-the-truck/

and the patchset has been refined to v8
https://lore.kernel.org/linux-iommu/20200823230309.28980-1-song.bao@hisilicon.com/
with one additional patch to remove magic number:
[PATCH v8 3/3] mm: cma: use CMA_MAX_NAME to define the length of cma name array
https://lore.kernel.org/linux-iommu/20200823230309.28980-4-song.bao@hisilicon.com/

Hopefully, you didn't miss it:-)
Does the new one need an Ack from Linux-mm maintainer?

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] iommu/arm-smmu-v3: add tracepoints for cmdq_issue_cmdlist

2020-08-28 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Friday, August 28, 2020 11:18 PM
> To: Song Bao Hua (Barry Song) ; Will Deacon
> 
> Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> j...@8bytes.org; Linuxarm 
> Subject: Re: [PATCH] iommu/arm-smmu-v3: add tracepoints for
> cmdq_issue_cmdlist
> 
> On 2020-08-28 12:02, Song Bao Hua (Barry Song) wrote:
> >
> >
> >> -Original Message-
> >> From: Will Deacon [mailto:w...@kernel.org]
> >> Sent: Friday, August 28, 2020 10:29 PM
> >> To: Song Bao Hua (Barry Song) 
> >> Cc: iommu@lists.linux-foundation.org;
> linux-arm-ker...@lists.infradead.org;
> >> robin.mur...@arm.com; j...@8bytes.org; Linuxarm
> 
> >> Subject: Re: [PATCH] iommu/arm-smmu-v3: add tracepoints for
> >> cmdq_issue_cmdlist
> >>
> >> On Thu, Aug 27, 2020 at 09:33:51PM +1200, Barry Song wrote:
> >>> cmdq_issue_cmdlist() is the hotspot that uses a lot of time. This patch
> >>> adds tracepoints for it to help debug.
> >>>
> >>> Signed-off-by: Barry Song 
> >>> ---
> >>>   * can furthermore develop an eBPF program to benchmark using this
> trace
> >>
> >> Hmm, don't these things have a history of becoming ABI? If so, I don't
> >> really want them in the driver at all, sorry. Do other drivers overcome
> >> this somehow?
> >
> > This kind of tracepoints mainly works as a low-overhead probe point for
> debug purpose. I don't think any
> > application would depend on it. It is for debugging. And there are lots of
> tracepoints in other drivers
> > even in iommu driver core and intel_iommu driver :-)
> >
> > developers use it in one of the below ways:
> >
> > 1. get trace print from the ring buffer by reading debugfs
> > root@ubuntu:/sys/kernel/debug/tracing/events/arm_smmu_v3# echo 1 >
> enable
> > # cat /sys/kernel/debug/tracing/trace_pipe
> > -0 [058] ..s1 125444.768083: issue_cmdlist_exit:
> arm-smmu-v3.2.auto cmd number=1 sync=1
> >-0 [058] ..s1 125444.768084: issue_cmdlist_entry:
> arm-smmu-v3.2.auto cmd number=1 sync=1
> >-0 [058] ..s1 125444.768085: issue_cmdlist_exit:
> arm-smmu-v3.2.auto cmd number=1 sync=1
> >-0 [058] ..s1 125444.768165: issue_cmdlist_entry:
> arm-smmu-v3.2.auto cmd number=1 sync=1
> >-0 [058] ..s1 125444.768168: issue_cmdlist_exit:
> arm-smmu-v3.2.auto cmd number=1 sync=1
> >-0 [058] ..s1 125444.768169: issue_cmdlist_entry:
> arm-smmu-v3.2.auto cmd number=1 sync=1
> >-0 [058] ..s1 125444.768171: issue_cmdlist_exit:
> arm-smmu-v3.2.auto cmd number=1 sync=1
> >-0 [058] ..s1 125444.768259: issue_cmdlist_entry:
> arm-smmu-v3.2.auto cmd number=1 sync=1
> >...
> >
> > This can replace printk with much much lower overhead.
> >
> > 2. add a hook function in tracepoint to do some latency measure and time
> statistics just like the eBPF example
> > I gave after the commit log.
> >
> > Using it, I can get the histogram of the execution time of
> cmdq_issue_cmdlist():
> > nsecs   : count distribution
> >   0 -> 1  : 0|
> |
> >   2 -> 3  : 0|
> |
> >   4 -> 7  : 0|
> |
> >   8 -> 15 : 0|
> |
> >  16 -> 31 : 0|
> |
> >  32 -> 63 : 0|
> |
> >  64 -> 127: 0|
> |
> > 128 -> 255: 0|
> |
> > 256 -> 511: 0|
> |
> > 512 -> 1023   : 58   |
> |
> >1024 -> 2047   : 22763
> ||
> >2048 -> 4095   : 13238|***
> |
> >
> > I feel it is very common to do this kind of things for analyzing the
> performance issue. For example, to easy the analysis
> > of softirq latency, softirq.c has the below code:
> >
> > asmlinkage __visible void __softirq_entry __do_softirq(void)
> > {
> > ...
> > trace_softirq_entry(vec_nr);
> > h->action(h);
> > trace_softirq_exit(vec_nr);
> > ...
> > }
> 
> If you only want to measure entry and exit of one specific function,
> though, can't the function graph tracer already do that?

Function graph is able to do this specific thi

RE: [PATCH] iommu/arm-smmu-v3: add tracepoints for cmdq_issue_cmdlist

2020-08-28 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Will Deacon [mailto:w...@kernel.org]
> Sent: Friday, August 28, 2020 10:29 PM
> To: Song Bao Hua (Barry Song) 
> Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> robin.mur...@arm.com; j...@8bytes.org; Linuxarm 
> Subject: Re: [PATCH] iommu/arm-smmu-v3: add tracepoints for
> cmdq_issue_cmdlist
> 
> On Thu, Aug 27, 2020 at 09:33:51PM +1200, Barry Song wrote:
> > cmdq_issue_cmdlist() is the hotspot that uses a lot of time. This patch
> > adds tracepoints for it to help debug.
> >
> > Signed-off-by: Barry Song 
> > ---
> >  * can furthermore develop an eBPF program to benchmark using this trace
> 
> Hmm, don't these things have a history of becoming ABI? If so, I don't
> really want them in the driver at all, sorry. Do other drivers overcome
> this somehow?

This kind of tracepoints mainly works as a low-overhead probe point for debug 
purpose. I don't think any
application would depend on it. It is for debugging. And there are lots of 
tracepoints in other drivers
even in iommu driver core and intel_iommu driver :-)

developers use it in one of the below ways:

1. get trace print from the ring buffer by reading debugfs
root@ubuntu:/sys/kernel/debug/tracing/events/arm_smmu_v3# echo 1 > enable
# cat /sys/kernel/debug/tracing/trace_pipe
-0 [058] ..s1 125444.768083: issue_cmdlist_exit: arm-smmu-v3.2.auto 
cmd number=1 sync=1
  -0 [058] ..s1 125444.768084: issue_cmdlist_entry: 
arm-smmu-v3.2.auto cmd number=1 sync=1   
  -0 [058] ..s1 125444.768085: issue_cmdlist_exit: 
arm-smmu-v3.2.auto cmd number=1 sync=1
  -0 [058] ..s1 125444.768165: issue_cmdlist_entry: 
arm-smmu-v3.2.auto cmd number=1 sync=1   
  -0 [058] ..s1 125444.768168: issue_cmdlist_exit: 
arm-smmu-v3.2.auto cmd number=1 sync=1
  -0 [058] ..s1 125444.768169: issue_cmdlist_entry: 
arm-smmu-v3.2.auto cmd number=1 sync=1   
  -0 [058] ..s1 125444.768171: issue_cmdlist_exit: 
arm-smmu-v3.2.auto cmd number=1 sync=1
  -0 [058] ..s1 125444.768259: issue_cmdlist_entry: 
arm-smmu-v3.2.auto cmd number=1 sync=1   
  ...

This can replace printk with much much lower overhead.

2. add a hook function in tracepoint to do some latency measure and time 
statistics just like the eBPF example
I gave after the commit log.

Using it, I can get the histogram of the execution time of cmdq_issue_cmdlist():
   nsecs   : count distribution 
 0 -> 1  : 0|| 
 2 -> 3  : 0|| 
 4 -> 7  : 0|| 
 8 -> 15 : 0|| 
16 -> 31 : 0|| 
32 -> 63 : 0|| 
64 -> 127: 0|| 
   128 -> 255: 0|| 
   256 -> 511: 0|| 
   512 -> 1023   : 58   || 
  1024 -> 2047   : 22763|| 
  2048 -> 4095   : 13238|*** | 

I feel it is very common to do this kind of things for analyzing the 
performance issue. For example, to easy the analysis
of softirq latency, softirq.c has the below code:

asmlinkage __visible void __softirq_entry __do_softirq(void)
{
...
trace_softirq_entry(vec_nr);
h->action(h);
trace_softirq_exit(vec_nr);
...
}

> 
> Will

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] iommu/arm-smmu-v3: add tracepoints for cmdq_issue_cmdlist

2020-08-28 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jean-Philippe Brucker [mailto:jean-phili...@linaro.org]
> Sent: Friday, August 28, 2020 7:41 PM
> To: Song Bao Hua (Barry Song) 
> Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> robin.mur...@arm.com; w...@kernel.org; Linuxarm 
> Subject: Re: [PATCH] iommu/arm-smmu-v3: add tracepoints for
> cmdq_issue_cmdlist
> 
> Hi,
> 
> On Thu, Aug 27, 2020 at 09:33:51PM +1200, Barry Song wrote:
> > cmdq_issue_cmdlist() is the hotspot that uses a lot of time. This
> > patch adds tracepoints for it to help debug.
> >
> > Signed-off-by: Barry Song 
> > ---
> >  * can furthermore develop an eBPF program to benchmark using this
> > trace
> 
> Have you tried using kprobe and kretprobe instead of tracepoints?
> Any noticeable performance drop?

Yes. Pls read this email.
kprobe overhead and OPTPROBES implementation on ARM64
https://www.spinics.net/lists/arm-kernel/msg828788.html

> 
> Thanks,
> Jean
> 
> >
> >   cmdlistlat.c:
> > #include 
> >
> > BPF_HASH(start, u32);
> > BPF_HISTOGRAM(dist);
> >
> > TRACEPOINT_PROBE(arm_smmu_v3, issue_cmdlist_entry) {
> > u32 pid;
> > u64 ts, *val;
> >
> > pid = bpf_get_current_pid_tgid();
> > ts = bpf_ktime_get_ns();
> > start.update(, );
> > return 0;
> > }
> >
> > TRACEPOINT_PROBE(arm_smmu_v3, issue_cmdlist_exit) {
> > u32 pid;
> > u64 *tsp, delta;
> >
> > pid = bpf_get_current_pid_tgid();
> > tsp = start.lookup();
> >
> > if (tsp != 0) {
> > delta = bpf_ktime_get_ns() - *tsp;
> > dist.increment(bpf_log2l(delta));
> > start.delete();
> > }
> >
> > return 0;
> > }
> >
> >  cmdlistlat.py:
> > #!/usr/bin/python3
> > #
> > from __future__ import print_function
> > from bcc import BPF
> > from ctypes import c_ushort, c_int, c_ulonglong from time import sleep
> > from sys import argv
> >
> > def usage():
> > print("USAGE: %s [interval [count]]" % argv[0])
> > exit()
> >
> > # arguments
> > interval = 5
> > count = -1
> > if len(argv) > 1:
> > try:
> > interval = int(argv[1])
> > if interval == 0:
> > raise
> > if len(argv) > 2:
> > count = int(argv[2])
> > except: # also catches -h, --help
> > usage()
> >
> > # load BPF program
> > b = BPF(src_file = "cmdlistlat.c")
> >
> > # header
> > print("Tracing... Hit Ctrl-C to end.")
> >
> > # output
> > loop = 0
> > do_exit = 0
> > while (1):
> > if count > 0:
> > loop += 1
> > if loop > count:
> > exit()
> > try:
> > sleep(interval)
> > except KeyboardInterrupt:
> > pass; do_exit = 1
> >
> > print()
> > b["dist"].print_log2_hist("nsecs")
> > b["dist"].clear()
> > if do_exit:
> > exit()
> >
> >
> >  drivers/iommu/arm/arm-smmu-v3/Makefile|  1 +
> >  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h | 48
> +++
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  8 
> >  3 files changed, 57 insertions(+)
> >  create mode 100644
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile
> > b/drivers/iommu/arm/arm-smmu-v3/Makefile
> > index 569e24e9f162..dba1087f91f3 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/Makefile
> > +++ b/drivers/iommu/arm/arm-smmu-v3/Makefile
> > @@ -1,2 +1,3 @@
> >  # SPDX-License-Identifier: GPL-2.0
> > +ccflags-y += -I$(src)   # needed for trace events
> >  obj-$(CONFIG_ARM_SMMU_V3) += arm-smmu-v3.o diff --git
> > a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h
> > b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h
> > new file mode 100644
> > index ..29ab96706124
> > --- /dev/null
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h
> > @@ -0,0 +1,48 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Copyright (C) 2020 Hisilicon Limited.
>

[PATCH] iommu/arm-smmu-v3: add tracepoints for cmdq_issue_cmdlist

2020-08-27 Thread Barry Song
cmdq_issue_cmdlist() is the hotspot that uses a lot of time. This patch
adds tracepoints for it to help debug.

Signed-off-by: Barry Song 
---
 * can furthermore develop an eBPF program to benchmark using this trace

  cmdlistlat.c:
#include 

BPF_HASH(start, u32);
BPF_HISTOGRAM(dist);

TRACEPOINT_PROBE(arm_smmu_v3, issue_cmdlist_entry)
{
u32 pid;
u64 ts, *val;

pid = bpf_get_current_pid_tgid();
ts = bpf_ktime_get_ns();
start.update(, );
return 0;
}

TRACEPOINT_PROBE(arm_smmu_v3, issue_cmdlist_exit)
{
u32 pid;
u64 *tsp, delta;

pid = bpf_get_current_pid_tgid();
tsp = start.lookup();

if (tsp != 0) {
delta = bpf_ktime_get_ns() - *tsp;
dist.increment(bpf_log2l(delta));
start.delete();
}

return 0;
}

 cmdlistlat.py:
#!/usr/bin/python3
#
from __future__ import print_function
from bcc import BPF
from ctypes import c_ushort, c_int, c_ulonglong
from time import sleep
from sys import argv

def usage():
print("USAGE: %s [interval [count]]" % argv[0])
exit()

# arguments
interval = 5
count = -1
if len(argv) > 1:
try:
interval = int(argv[1])
if interval == 0:
raise
if len(argv) > 2:
count = int(argv[2])
except: # also catches -h, --help
usage()

# load BPF program
b = BPF(src_file = "cmdlistlat.c")

# header
print("Tracing... Hit Ctrl-C to end.")

# output
loop = 0
do_exit = 0
while (1):
if count > 0:
loop += 1
if loop > count:
exit()
try:
sleep(interval)
except KeyboardInterrupt:
pass; do_exit = 1

print()
b["dist"].print_log2_hist("nsecs")
b["dist"].clear()
if do_exit:
exit()


 drivers/iommu/arm/arm-smmu-v3/Makefile|  1 +
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h | 48 +++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  8 
 3 files changed, 57 insertions(+)
 create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h

diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile 
b/drivers/iommu/arm/arm-smmu-v3/Makefile
index 569e24e9f162..dba1087f91f3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/Makefile
+++ b/drivers/iommu/arm/arm-smmu-v3/Makefile
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
+ccflags-y += -I$(src)   # needed for trace events
 obj-$(CONFIG_ARM_SMMU_V3) += arm-smmu-v3.o
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h
new file mode 100644
index ..29ab96706124
--- /dev/null
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2020 Hisilicon Limited.
+ */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM arm_smmu_v3
+
+#if !defined(_ARM_SMMU_V3_TRACE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _ARM_SMMU_V3_TRACE_H
+
+#include 
+
+struct device;
+
+DECLARE_EVENT_CLASS(issue_cmdlist_class,
+   TP_PROTO(struct device *dev, int n, bool sync),
+   TP_ARGS(dev, n, sync),
+
+   TP_STRUCT__entry(
+   __string(device, dev_name(dev))
+   __field(int, n)
+   __field(bool, sync)
+   ),
+   TP_fast_assign(
+   __assign_str(device, dev_name(dev));
+   __entry->n = n;
+   __entry->sync = sync;
+   ),
+   TP_printk("%s cmd number=%d sync=%d",
+   __get_str(device), __entry->n, __entry->sync)
+);
+
+#define DEFINE_ISSUE_CMDLIST_EVENT(name)   \
+DEFINE_EVENT(issue_cmdlist_class, name,\
+   TP_PROTO(struct device *dev, int n, bool sync), \
+   TP_ARGS(dev, n, sync))
+
+DEFINE_ISSUE_CMDLIST_EVENT(issue_cmdlist_entry);
+DEFINE_ISSUE_CMDLIST_EVENT(issue_cmdlist_exit);
+
+#endif /* _ARM_SMMU_V3_TRACE_H */
+
+#undef TRACE_INCLUDE_PATH
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_PATH .
+#define TRACE_INCLUDE_FILE arm-smmu-v3-trace
+#include 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7332251dd8cd..e2d7d5f1d234 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -33,6 +33,8 @@
 
 #include 
 
+#include "arm-smmu-v3-trace.h"
+
 /* MMIO registers */
 #define ARM_SMMU_IDR0  0x0
 #define IDR0_ST_LVLGENMASK(28, 27)
@@ -1389,6 +1391,8 @@ static int arm_smmu_cmdq_issue_cmdlist(struct 
arm_smmu_device *smmu,
}, head = llq;
int ret = 0;
 
+   trace_issue_cmdlist_entry(smmu->dev, n, sync);
+
/* 1. Allocate some space in the queue */
  

[PATCH v5 1/3] iommu/arm-smmu-v3: replace symbolic permissions by octal permissions for module parameter

2020-08-27 Thread Barry Song
This fixed the below checkpatch issue:
WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using
octal permissions '0444'.
417: FILE: drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:417:
module_param_named(disable_bypass, disable_bypass, bool, S_IRUGO);

Reviewed-by: Robin Murphy 
Signed-off-by: Barry Song 
---
 -v5: add Robin's reviewed-by

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7196207be7ea..eea5f7c6d9ab 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -414,7 +414,7 @@
 #define MSI_IOVA_LENGTH0x10
 
 static bool disable_bypass = 1;
-module_param_named(disable_bypass, disable_bypass, bool, S_IRUGO);
+module_param_named(disable_bypass, disable_bypass, bool, 0444);
 MODULE_PARM_DESC(disable_bypass,
"Disable bypass streams such that incoming transactions from devices 
that are not attached to an iommu domain will report an abort back to the 
device and will not be allowed to pass through the SMMU.");
 
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 0/3] iommu/arm-smmu-v3: permit users to disable msi polling

2020-08-27 Thread Barry Song
patch 1/3 and patch 2/3 are the preparation of patch 3/3 which permits users
to disable MSI-based polling by cmd line.

-v5:
  add Robin's reviewed-by

-v4:
  with respect to Robin's comments
  * cleanup the code of the existing module parameter disable_bypass
  * add ARM_SMMU_OPT_MSIPOLL flag. on the other hand, we only need to check
a bit in options rather than two bits in features

Barry Song (3):
  iommu/arm-smmu-v3: replace symbolic permissions by octal permissions
for module parameter
  iommu/arm-smmu-v3: replace module_param_named by module_param for
disable_bypass
  iommu/arm-smmu-v3: permit users to disable msi polling

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 2/3] iommu/arm-smmu-v3: replace module_param_named by module_param for disable_bypass

2020-08-27 Thread Barry Song
Just use module_param() - going out of the way to specify a "different"
name that's identical to the variable name is silly.

Reviewed-by: Robin Murphy 
Signed-off-by: Barry Song 
---
 -v5: add Robin's reviewed-by

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index eea5f7c6d9ab..5b40d535a7c8 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -414,7 +414,7 @@
 #define MSI_IOVA_LENGTH0x10
 
 static bool disable_bypass = 1;
-module_param_named(disable_bypass, disable_bypass, bool, 0444);
+module_param(disable_bypass, bool, 0444);
 MODULE_PARM_DESC(disable_bypass,
"Disable bypass streams such that incoming transactions from devices 
that are not attached to an iommu domain will report an abort back to the 
device and will not be allowed to pass through the SMMU.");
 
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 3/3] iommu/arm-smmu-v3: permit users to disable msi polling

2020-08-27 Thread Barry Song
Polling by MSI isn't necessarily faster than polling by SEV. Tests on
hi1620 show hns3 100G NIC network throughput can improve from 25G to
27G if we disable MSI polling while running 16 netperf threads sending
UDP packets in size 32KB. TX throughput can improve from 7G to 7.7G for
single thread.
The reason for the throughput improvement is that the latency to poll
the completion of CMD_SYNC becomes smaller. After sending a CMD_SYNC
in an empty cmd queue, typically we need to wait for 280ns using MSI
polling. But we only need around 190ns after disabling MSI polling.
This patch provides a command line option so that users can decide to
use MSI polling or not based on their tests.

Reviewed-by: Robin Murphy 
Signed-off-by: Barry Song 
---
 -v5: add Robin's reviewed-by

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5b40d535a7c8..7332251dd8cd 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -418,6 +418,11 @@ module_param(disable_bypass, bool, 0444);
 MODULE_PARM_DESC(disable_bypass,
"Disable bypass streams such that incoming transactions from devices 
that are not attached to an iommu domain will report an abort back to the 
device and will not be allowed to pass through the SMMU.");
 
+static bool disable_msipolling;
+module_param(disable_msipolling, bool, 0444);
+MODULE_PARM_DESC(disable_msipolling,
+   "Disable MSI-based polling for CMD_SYNC completion.");
+
 enum pri_resp {
PRI_RESP_DENY = 0,
PRI_RESP_FAIL = 1,
@@ -652,6 +657,7 @@ struct arm_smmu_device {
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0)
 #define ARM_SMMU_OPT_PAGE0_REGS_ONLY   (1 << 1)
+#define ARM_SMMU_OPT_MSIPOLL   (1 << 2)
u32 options;
 
struct arm_smmu_cmdqcmdq;
@@ -992,8 +998,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct 
arm_smmu_device *smmu,
 * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI
 * payload, so the write will zero the entire command on that platform.
 */
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
-   smmu->features & ARM_SMMU_FEAT_COHERENCY) {
+   if (smmu->options & ARM_SMMU_OPT_MSIPOLL) {
ent.sync.msiaddr = q->base_dma + Q_IDX(>llq, prod) *
   q->ent_dwords * 8;
}
@@ -1332,8 +1337,7 @@ static int __arm_smmu_cmdq_poll_until_consumed(struct 
arm_smmu_device *smmu,
 static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device *smmu,
 struct arm_smmu_ll_queue *llq)
 {
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
-   smmu->features & ARM_SMMU_FEAT_COHERENCY)
+   if (smmu->options & ARM_SMMU_OPT_MSIPOLL)
return __arm_smmu_cmdq_poll_until_msi(smmu, llq);
 
return __arm_smmu_cmdq_poll_until_consumed(smmu, llq);
@@ -3741,8 +3745,11 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
if (reg & IDR0_SEV)
smmu->features |= ARM_SMMU_FEAT_SEV;
 
-   if (reg & IDR0_MSI)
+   if (reg & IDR0_MSI) {
smmu->features |= ARM_SMMU_FEAT_MSI;
+   if (coherent && !disable_msipolling)
+   smmu->options |= ARM_SMMU_OPT_MSIPOLL;
+   }
 
if (reg & IDR0_HYP)
smmu->features |= ARM_SMMU_FEAT_HYP;
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 1/3] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-23 Thread Barry Song
Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get
coherent DMA buffers to save their command queues and page tables. As
there is only one default CMA in the whole system, SMMUs on nodes other
than node0 will get remote memory. This leads to significant latency.

This patch provides per-numa CMA so that drivers like SMMU can get local
memory. Tests show localizing CMA can decrease dma_unmap latency much.
For instance, before this patch, SMMU on node2  has to wait for more than
560ns for the completion of CMD_SYNC in an empty command queue; with this
patch, it needs 240ns only.

A positive side effect of this patch would be improving performance even
further for those users who are worried about performance more than DMA
security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all
drivers can get local coherent DMA buffers.

Also, this patch changes the default CONFIG_CMA_AREAS to 19 in NUMA. As
1+CONFIG_CMA_AREAS should be quite enough for most servers on the market
even they enable both hugetlb_cma and pernuma_cma.
2 numa nodes: 2(hugetlb) + 2(pernuma) + 1(default global cma) = 5
4 numa nodes: 4(hugetlb) + 4(pernuma) + 1(default global cma) = 9
8 numa nodes: 8(hugetlb) + 8(pernuma) + 1(default global cma) = 17

Cc: Randy Dunlap 
Cc: Mike Kravetz 
Cc: Jonathan Cameron 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Will Deacon 
Cc: Robin Murphy 
Cc: Ganapatrao Kulkarni 
Cc: Catalin Marinas 
Cc: Nicolas Saenz Julienne 
Cc: Steve Capper 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Signed-off-by: Barry Song 
---
-v8:
 * rename parameter from pernuma_cma to cma_pernuma with respect to the comments
   of Mike Rapoport and Randy Dunlap
 * if both hugetlb_cma and pernuma_cma are enabled, we may need a larger default
   CMA_AREAS. In numa, we set it to 19 based on the discussion with Mike Kravetz

 .../admin-guide/kernel-parameters.txt |  11 ++
 include/linux/dma-contiguous.h|   6 ++
 kernel/dma/Kconfig|  11 ++
 kernel/dma/contiguous.c   | 100 --
 mm/Kconfig|   3 +-
 5 files changed, 120 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index bdc1f33fd3d1..8291e2e7a99c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -599,6 +599,17 @@
altogether. For more information, see
include/linux/dma-contiguous.h
 
+   cma_pernuma=nn[MG]
+   [ARM64,KNL]
+   Sets the size of kernel per-numa memory area for
+   contiguous memory allocations. A value of 0 disables
+   per-numa CMA altogether. And If this option is not
+   specificed, the default value is 0.
+   With per-numa CMA enabled, DMA users on node nid will
+   first try to allocate buffer from the pernuma area
+   which is located in node nid, if the allocation fails,
+   they will fallback to the global default memory area.
+
cmo_free_hint=  [PPC] Format: { yes | no }
Specify whether pages are marked as being inactive
when they are freed.  This is used in CMO environments
diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h
index 03f8e98e3bcc..fe55e004f1f4 100644
--- a/include/linux/dma-contiguous.h
+++ b/include/linux/dma-contiguous.h
@@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct device *dev, 
struct page *page,
 
 #endif
 
+#ifdef CONFIG_DMA_PERNUMA_CMA
+void dma_pernuma_cma_reserve(void);
+#else
+static inline void dma_pernuma_cma_reserve(void) { }
+#endif
+
 #endif
 
 #endif
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index 847a9d1fa634..0ddfb5510fe4 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -118,6 +118,17 @@ config DMA_CMA
  If unsure, say "n".
 
 if  DMA_CMA
+
+config DMA_PERNUMA_CMA
+   bool "Enable separate DMA Contiguous Memory Area for each NUMA Node"
+   default NUMA && ARM64
+   help
+ Enable this option to get pernuma CMA areas so that devices like
+ ARM64 SMMU can get local memory by DMA coherent APIs.
+
+ You can set the size of pernuma CMA by specifying "cma_pernuma=size"
+ on the kernel's command line.
+
 comment "Default contiguous memory area size:"
 
 config CMA_SIZE_MBYTES
diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index cff7e60968b9..aa53384fd7dc 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -69,6 +69,19 @@ static int __init early_cma(char *p)
 }
 early_param("cma", early_cma);
 
+#ifdef CONFIG_DMA_PERNUMA_CMA
+
+static stru

[PATCH v8 2/3] arm64: mm: reserve per-numa CMA to localize coherent dma buffers

2020-08-23 Thread Barry Song
Right now, smmu is using dma_alloc_coherent() to get memory to save queues
and tables. Typically, on ARM64 server, there is a default CMA located at
node0, which could be far away from node2, node3 etc.
with this patch, smmu will get memory from local numa node to save command
queues and page tables. that means dma_unmap latency will be shrunk much.
Meanwhile, when iommu.passthrough is on, device drivers which call dma_
alloc_coherent() will also get local memory and avoid the travel between
numa nodes.

Acked-by: Will Deacon 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Cc: Ganapatrao Kulkarni 
Cc: Catalin Marinas 
Cc: Nicolas Saenz Julienne 
Cc: Steve Capper 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Signed-off-by: Barry Song 
---
 arch/arm64/mm/init.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 481d22c32a2e..f1c75957ff3c 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -429,6 +429,8 @@ void __init bootmem_init(void)
arm64_hugetlb_cma_reserve();
 #endif
 
+   dma_pernuma_cma_reserve();
+
/*
 * sparse_init() tries to allocate memory from memblock, so must be
 * done after the fixed reservations
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 3/3] mm: cma: use CMA_MAX_NAME to define the length of cma name array

2020-08-23 Thread Barry Song
CMA_MAX_NAME should be visible to CMA's users as they might need it to set
the name of CMA areas and avoid hardcoding the size locally.
So this patch moves CMA_MAX_NAME from local header file to include/linux
header file and removes the hardcode in both hugetlb.c and contiguous.c.

Cc: Mike Kravetz 
Cc: Roman Gushchin 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Will Deacon 
Cc: Robin Murphy 
Cc: Andrew Morton 
Signed-off-by: Barry Song 
---
 this patch is fixing the magic number issue with respect to Will's comment 
here:
 
https://lore.kernel.org/linux-iommu/4ab78767553f48a584217063f6f24...@hisilicon.com/

 include/linux/cma.h | 2 ++
 kernel/dma/contiguous.c | 2 +-
 mm/cma.h| 2 --
 mm/hugetlb.c| 4 ++--
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/cma.h b/include/linux/cma.h
index 6ff79fefd01f..217999c8a762 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -18,6 +18,8 @@
 
 #endif
 
+#define CMA_MAX_NAME 64
+
 struct cma;
 
 extern unsigned long totalcma_pages;
diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index aa53384fd7dc..f4c150810fd2 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -119,7 +119,7 @@ void __init dma_pernuma_cma_reserve(void)
 
for_each_online_node(nid) {
int ret;
-   char name[20];
+   char name[CMA_MAX_NAME];
struct cma **cma = _contiguous_pernuma_area[nid];
 
snprintf(name, sizeof(name), "pernuma%d", nid);
diff --git a/mm/cma.h b/mm/cma.h
index 20f6e24bc477..42ae082cb067 100644
--- a/mm/cma.h
+++ b/mm/cma.h
@@ -4,8 +4,6 @@
 
 #include 
 
-#define CMA_MAX_NAME 64
-
 struct cma {
unsigned long   base_pfn;
unsigned long   count;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a301c2d672bf..9eec0ea9ba68 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5683,12 +5683,12 @@ void __init hugetlb_cma_reserve(int order)
reserved = 0;
for_each_node_state(nid, N_ONLINE) {
int res;
-   char name[20];
+   char name[CMA_MAX_NAME];
 
size = min(per_node, hugetlb_cma_size - reserved);
size = round_up(size, PAGE_SIZE << order);
 
-   snprintf(name, 20, "hugetlb%d", nid);
+   snprintf(name, sizeof(name), "hugetlb%d", nid);
res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order,
 0, false, name,
 _cma[nid], nid);
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 0/3] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

2020-08-23 Thread Barry Song
Ganapatrao Kulkarni has put some effort on making arm-smmu-v3 use local
memory to save command queues[1]. I also did similar job in patch
"iommu/arm-smmu-v3: allocate the memory of queues in local numa node"
[2] while not realizing Ganapatrao has done that before.

But it seems it is much better to make dma_alloc_coherent() to be
inherently NUMA-aware on NUMA-capable systems.

Right now, smmu is using dma_alloc_coherent() to get memory to save queues
and tables. Typically, on ARM64 server, there is a default CMA located at
node0, which could be far away from node2, node3 etc.
Saving queues and tables remotely will increase the latency of ARM SMMU
significantly. For example, when SMMU is at node2 and the default global
CMA is at node0, after sending a CMD_SYNC in an empty command queue, we
have to wait more than 550ns for the completion of the command CMD_SYNC.
However, if we save them locally, we only need to wait for 240ns.

with per-numa CMA, smmu will get memory from local numa node to save command
queues and page tables. that means dma_unmap latency will be shrunk much.

Meanwhile, when iommu.passthrough is on, device drivers which call dma_
alloc_coherent() will also get local memory and avoid the travel between
numa nodes.

[1] https://lists.linuxfoundation.org/pipermail/iommu/2017-October/024455.html
[2] https://www.spinics.net/lists/iommu/msg44767.html

-v8:
 * rename parameter from pernuma_cma to cma_pernuma with respect to the comments
   of Mike Rapoport and Randy Dunlap
 * if both hugetlb_cma and pernuma_cma are enabled, we may need a larger default
   CMA_AREAS. In numa, we set it to 19 based on the discussion with Mike Kravetz

-v7:
 * add Will's acked-by
 * some cleanup with respect to Will's comments
 * add patch 3/3 to remove the hardcode of defining the size of cma name.
   this patch requires some header file change in include/linux

-v6:
 * rebase on top of 5.9-rc1
 * doc cleanup

-v5:
 refine code according to Christoph Hellwig's comments
 * remove Kconfig option for pernuma cma size;
 * add Kconfig option for pernuma cma enable;
 * code cleanup like line over 80 char

 I haven't removed the cma NULL check code in cma_alloc() as it requires
 a bundle of other changes. So I prefer to handle this issue separately.

-v4:
 * rebase on top of Christoph Hellwig's patch:
 [PATCH v2] dma-contiguous: cleanup dma_alloc_contiguous
 https://lore.kernel.org/linux-iommu/20200723120133.94105-1-...@lst.de/
 * cleanup according to Christoph's comment
 * rebase on top of linux-next to avoid arch/arm64 conflicts
 * reserve cma by checking N_MEMORY rather than N_ONLINE

-v3:
  * move to use page_to_nid() while freeing cma with respect to Robin's
  comment, but this will only work after applying my below patch:
  "mm/cma.c: use exact_nid true to fix possible per-numa cma leak"
  https://marc.info/?l=linux-mm=159333034726647=2

  * handle the case count <= 1 more properly according to Robin's
  comment;

  * add pernuma_cma parameter to support dynamic setting of per-numa
  cma size;
  ideally we can leverage the CMA_SIZE_MBYTES, CMA_SIZE_PERCENTAGE and
  "cma=" kernel parameter and avoid a new paramter separately for per-
  numa cma. Practically, it is really too complicated considering the
  below problems:
  (1) if we leverage the size of default numa for per-numa, we have to
  avoid creating two cma with same size in node0 since default cma is
  probably on node0.
  (2) default cma can consider the address limitation for old devices
  while per-numa cma doesn't support GFP_DMA and GFP_DMA32. all
  allocations with limitation flags will fallback to default one.
  (3) hard to apply CMA_SIZE_PERCENTAGE to per-numa. it is hard to
  decide if the percentage should apply to the whole memory size
  or only apply to the memory size of a specific numa node.
  (4) default cma size has CMA_SIZE_SEL_MIN and CMA_SIZE_SEL_MAX, it
  makes things even more complicated to per-numa cma.

  I haven't figured out a good way to leverage the size of default cma
  for per-numa cma. it seems a separate parameter for per-numa could
  make life easier.

  * move dma_pernuma_cma_reserve() after hugetlb_cma_reserve() to
  reuse the comment before hugetlb_cma_reserve() with respect to
  Robin's comment

-v2: 
  * fix some issues reported by kernel test robot
  * fallback to default cma while allocation fails in per-numa cma
 free memory properly

Barry Song (3):
  dma-contiguous: provide the ability to reserve per-numa CMA
  arm64: mm: reserve per-numa CMA to localize coherent dma buffers
  mm: cma: use CMA_MAX_NAME to define the length of cma name array

 .../admin-guide/kernel-parameters.txt |  11 ++
 arch/arm64/mm/init.c  |   2 +
 include/linux/cma.h   |   2 +
 include/linux/dma-contiguous.h|   6 ++
 kernel/dma/Kconfig|  11 ++
 kernel/dma/contiguous.c   | 1

RE: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

2020-08-21 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Saturday, August 22, 2020 7:27 AM
> To: 'Mike Kravetz' ; h...@lst.de;
> m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> a...@linux-foundation.org
> Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Zengtao (B) ;
> huangdaode ; Linuxarm 
> Subject: RE: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by
> per-NUMA CMA
> 
> 
> 
> > -Original Message-
> > From: Mike Kravetz [mailto:mike.krav...@oracle.com]
> > Sent: Saturday, August 22, 2020 5:53 AM
> > To: Song Bao Hua (Barry Song) ; h...@lst.de;
> > m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org;
> > ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> > a...@linux-foundation.org
> > Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> > linux-ker...@vger.kernel.org; Zengtao (B) ;
> > huangdaode ; Linuxarm
> 
> > Subject: Re: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by
> > per-NUMA CMA
> >
> > Hi Barry,
> > Sorry for jumping in so late.
> >
> > On 8/21/20 4:33 AM, Barry Song wrote:
> > >
> > > with per-numa CMA, smmu will get memory from local numa node to save
> > command
> > > queues and page tables. that means dma_unmap latency will be shrunk
> > much.
> >
> > Since per-node CMA areas for hugetlb was introduced, I have been thinking
> > about the limited number of CMA areas.  In most configurations, I believe
> > it is limited to 7.  And, IIRC it is not something that can be changed at
> > runtime, you need to reconfig and rebuild to increase the number.  In
> contrast
> > some configs have NODES_SHIFT set to 10.  I wasn't too worried because of
> > the limited hugetlb use case.  However, this series is adding another user
> > of per-node CMA areas.
> >
> > With more users, should try to sync up number of CMA areas and number of
> > nodes?  Or, perhaps I am worrying about nothing?
> 
> Hi Mike,
> The current limitation is 8. If the server has 4 nodes and we enable both
> pernuma
> CMA and hugetlb, the last node will fail to get one cma area as the default
> global cma area will take 1 of 8. So users need to change menuconfig.
> If the server has 8 nodes, we enable one of pernuma cma and hugetlb, one
> node
> will fail to get cma.
> 
> We may set the default number of CMA areas as 8+MAX_NODES(if hugetlb
> enabled) +
> MAX_NODES(if pernuma cma enabled) if we don't expect users to change
> config, but
> right now hugetlb has not an option in Kconfig to enable or disable like
> pernuma cma
> has DMA_PERNUMA_CMA.

I would prefer we make some changes like:

config CMA_AREAS
int "Maximum count of the CMA areas"
depends on CMA
+   default 19 if NUMA
default 7
help
  CMA allows to create CMA areas for particular purpose, mainly,
  used as device private area. This parameter sets the maximum
  number of CMA area in the system.

- If unsure, leave the default value "7".
+ If unsure, leave the default value "7" or "19" if NUMA is used.

1+ CONFIG_CMA_AREAS should be quite enough for almost all servers in the 
markets.

If 2 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*2 + 
1 = 5
If 4 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*4 + 
1 = 9-> default ARM64 config.
If 8 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*8 + 
1 = 17

The default value is supporting the most common case and is not going to 
support those servers
with NODES_SHIFT=10, they can make their own config just like users need to 
increase CMA_AREAS
if they add many cma areas in device tree in a system even without NUMA.

How do you think, mike?

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

2020-08-21 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Mike Kravetz [mailto:mike.krav...@oracle.com]
> Sent: Saturday, August 22, 2020 5:53 AM
> To: Song Bao Hua (Barry Song) ; h...@lst.de;
> m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> a...@linux-foundation.org
> Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Zengtao (B) ;
> huangdaode ; Linuxarm 
> Subject: Re: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by
> per-NUMA CMA
> 
> Hi Barry,
> Sorry for jumping in so late.
> 
> On 8/21/20 4:33 AM, Barry Song wrote:
> >
> > with per-numa CMA, smmu will get memory from local numa node to save
> command
> > queues and page tables. that means dma_unmap latency will be shrunk
> much.
> 
> Since per-node CMA areas for hugetlb was introduced, I have been thinking
> about the limited number of CMA areas.  In most configurations, I believe
> it is limited to 7.  And, IIRC it is not something that can be changed at
> runtime, you need to reconfig and rebuild to increase the number.  In contrast
> some configs have NODES_SHIFT set to 10.  I wasn't too worried because of
> the limited hugetlb use case.  However, this series is adding another user
> of per-node CMA areas.
> 
> With more users, should try to sync up number of CMA areas and number of
> nodes?  Or, perhaps I am worrying about nothing?

Hi Mike,
The current limitation is 8. If the server has 4 nodes and we enable both 
pernuma
CMA and hugetlb, the last node will fail to get one cma area as the default
global cma area will take 1 of 8. So users need to change menuconfig.
If the server has 8 nodes, we enable one of pernuma cma and hugetlb, one node
will fail to get cma.

We may set the default number of CMA areas as 8+MAX_NODES(if hugetlb enabled) +
MAX_NODES(if pernuma cma enabled) if we don't expect users to change config, but
right now hugetlb has not an option in Kconfig to enable or disable like 
pernuma cma
has DMA_PERNUMA_CMA.

> --
> Mike Kravetz

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Randy Dunlap [mailto:rdun...@infradead.org]
> Sent: Saturday, August 22, 2020 4:08 AM
> To: Song Bao Hua (Barry Song) ; h...@lst.de;
> m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> a...@linux-foundation.org
> Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Zengtao (B) ;
> huangdaode ; Linuxarm ;
> Jonathan Cameron ; Nicolas Saenz Julienne
> ; Steve Capper ; Mike
> Rapoport 
> Subject: Re: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve
> per-numa CMA
> 
> On 8/21/20 4:33 AM, Barry Song wrote:
> > ---
> >  -v7: with respect to Will's comments
> >  * move to use for_each_online_node
> >  * add description if users don't specify pernuma_cma
> >  * provide default value for CONFIG_DMA_PERNUMA_CMA
> >
> >  .../admin-guide/kernel-parameters.txt |  11 ++
> >  include/linux/dma-contiguous.h|   6 ++
> >  kernel/dma/Kconfig|  11 ++
> >  kernel/dma/contiguous.c   | 100
> --
> >  4 files changed, 118 insertions(+), 10 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> > index bdc1f33fd3d1..c609527fc35a 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -599,6 +599,17 @@
> > altogether. For more information, see
> > include/linux/dma-contiguous.h
> >
> > +   pernuma_cma=nn[MG]
> > +   [ARM64,KNL]
> > +   Sets the size of kernel per-numa memory area for
> > +   contiguous memory allocations. A value of 0 disables
> > +   per-numa CMA altogether. And If this option is not
> > +   specificed, the default value is 0.
> > +   With per-numa CMA enabled, DMA users on node nid will
> > +   first try to allocate buffer from the pernuma area
> > +   which is located in node nid, if the allocation fails,
> > +   they will fallback to the global default memory area.
> > +
> 
> Entries in kernel-parameters.txt are supposed to be in alphabetical order
> but this one is not.  If you want to keep it near the cma= entry, you can
> rename it like Mike suggested.  Otherwise it needs to be moved.

As I've replied in Mike's comment, I'd like to rename it to cma_per...

> 
> 
> > cmo_free_hint=  [PPC] Format: { yes | no }
> > Specify whether pages are marked as being inactive
> > when they are freed.  This is used in CMO environments
> 
> 
> 
> --
> ~Randy

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Mike Rapoport [mailto:r...@linux.ibm.com]
> Sent: Saturday, August 22, 2020 2:28 AM
> To: Song Bao Hua (Barry Song) 
> Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> w...@kernel.org; ganapatrao.kulka...@cavium.com;
> catalin.mari...@arm.com; a...@linux-foundation.org;
> iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Zengtao (B) ;
> huangdaode ; Linuxarm ;
> Jonathan Cameron ; Nicolas Saenz Julienne
> ; Steve Capper 
> Subject: Re: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve
> per-numa CMA
> 
> On Fri, Aug 21, 2020 at 11:33:53PM +1200, Barry Song wrote:
> > Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get
> > coherent DMA buffers to save their command queues and page tables. As
> > there is only one default CMA in the whole system, SMMUs on nodes other
> > than node0 will get remote memory. This leads to significant latency.
> >
> > This patch provides per-numa CMA so that drivers like SMMU can get local
> > memory. Tests show localizing CMA can decrease dma_unmap latency much.
> > For instance, before this patch, SMMU on node2  has to wait for more than
> > 560ns for the completion of CMD_SYNC in an empty command queue; with
> this
> > patch, it needs 240ns only.
> >
> > A positive side effect of this patch would be improving performance even
> > further for those users who are worried about performance more than DMA
> > security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all
> > drivers can get local coherent DMA buffers.
> >
> > Cc: Jonathan Cameron 
> > Cc: Christoph Hellwig 
> > Cc: Marek Szyprowski 
> > Cc: Will Deacon 
> > Cc: Robin Murphy 
> > Cc: Ganapatrao Kulkarni 
> > Cc: Catalin Marinas 
> > Cc: Nicolas Saenz Julienne 
> > Cc: Steve Capper 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Signed-off-by: Barry Song 
> > ---
> >  -v7: with respect to Will's comments
> >  * move to use for_each_online_node
> >  * add description if users don't specify pernuma_cma
> >  * provide default value for CONFIG_DMA_PERNUMA_CMA
> >
> >  .../admin-guide/kernel-parameters.txt |  11 ++
> >  include/linux/dma-contiguous.h|   6 ++
> >  kernel/dma/Kconfig|  11 ++
> >  kernel/dma/contiguous.c   | 100
> --
> >  4 files changed, 118 insertions(+), 10 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> > index bdc1f33fd3d1..c609527fc35a 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -599,6 +599,17 @@
> > altogether. For more information, see
> > include/linux/dma-contiguous.h
> >
> > +   pernuma_cma=nn[MG]
> 
> Maybe cma_pernuma or cma_pernode?

Sounds good.

> 
> > +   [ARM64,KNL]
> > +   Sets the size of kernel per-numa memory area for
> > +   contiguous memory allocations. A value of 0 disables
> > +   per-numa CMA altogether. And If this option is not
> > +   specificed, the default value is 0.
> > +   With per-numa CMA enabled, DMA users on node nid will
> > +   first try to allocate buffer from the pernuma area
> > +   which is located in node nid, if the allocation fails,
> > +   they will fallback to the global default memory area.
> > +
> > cmo_free_hint=  [PPC] Format: { yes | no }
> > Specify whether pages are marked as being inactive
> > when they are freed.  This is used in CMO environments
> > diff --git a/include/linux/dma-contiguous.h
> b/include/linux/dma-contiguous.h
> > index 03f8e98e3bcc..fe55e004f1f4 100644
> > --- a/include/linux/dma-contiguous.h
> > +++ b/include/linux/dma-contiguous.h
> > @@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct
> device *dev, struct page *page,
> >
> >  #endif
> >
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +void dma_pernuma_cma_reserve(void);
> > +#else
> > +static inline void dma_pernuma_cma_reserve(void) { }
> > +#endif
> > +
> >  #endif
> >
> >  #endif
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index 847a9d1fa634..c38979d45b13 100644
> &

[PATCH v7 2/3] arm64: mm: reserve per-numa CMA to localize coherent dma buffers

2020-08-21 Thread Barry Song
Right now, smmu is using dma_alloc_coherent() to get memory to save queues
and tables. Typically, on ARM64 server, there is a default CMA located at
node0, which could be far away from node2, node3 etc.
with this patch, smmu will get memory from local numa node to save command
queues and page tables. that means dma_unmap latency will be shrunk much.
Meanwhile, when iommu.passthrough is on, device drivers which call dma_
alloc_coherent() will also get local memory and avoid the travel between
numa nodes.

Acked-by: Will Deacon 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Cc: Ganapatrao Kulkarni 
Cc: Catalin Marinas 
Cc: Nicolas Saenz Julienne 
Cc: Steve Capper 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Signed-off-by: Barry Song 
---
 -v7: add Will's acked-by

 arch/arm64/mm/init.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 481d22c32a2e..f1c75957ff3c 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -429,6 +429,8 @@ void __init bootmem_init(void)
arm64_hugetlb_cma_reserve();
 #endif
 
+   dma_pernuma_cma_reserve();
+
/*
 * sparse_init() tries to allocate memory from memblock, so must be
 * done after the fixed reservations
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v7 3/3] mm: cma: use CMA_MAX_NAME to define the length of cma name array

2020-08-21 Thread Barry Song
CMA_MAX_NAME should be visible to CMA's users as they might need it to set
the name of CMA areas and avoid hardcoding the size locally.
So this patch moves CMA_MAX_NAME from local header file to include/linux
header file and removes the magic number in hugetlb.c and contiguous.c.

Cc: Mike Kravetz 
Cc: Roman Gushchin 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Will Deacon 
Cc: Robin Murphy 
Cc: Andrew Morton 
Signed-off-by: Barry Song 
---
 this patch is fixing the magic number issue with respect to Will's comment 
here:
 
https://lore.kernel.org/linux-iommu/4ab78767553f48a584217063f6f24...@hisilicon.com/

 include/linux/cma.h | 2 ++
 kernel/dma/contiguous.c | 2 +-
 mm/cma.h| 2 --
 mm/hugetlb.c| 4 ++--
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/cma.h b/include/linux/cma.h
index 6ff79fefd01f..217999c8a762 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -18,6 +18,8 @@
 
 #endif
 
+#define CMA_MAX_NAME 64
+
 struct cma;
 
 extern unsigned long totalcma_pages;
diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index 0383c9b86715..d2d6b715c274 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -119,7 +119,7 @@ void __init dma_pernuma_cma_reserve(void)
 
for_each_online_node(nid) {
int ret;
-   char name[20];
+   char name[CMA_MAX_NAME];
struct cma **cma = _contiguous_pernuma_area[nid];
 
snprintf(name, sizeof(name), "pernuma%d", nid);
diff --git a/mm/cma.h b/mm/cma.h
index 20f6e24bc477..42ae082cb067 100644
--- a/mm/cma.h
+++ b/mm/cma.h
@@ -4,8 +4,6 @@
 
 #include 
 
-#define CMA_MAX_NAME 64
-
 struct cma {
unsigned long   base_pfn;
unsigned long   count;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a301c2d672bf..9eec0ea9ba68 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5683,12 +5683,12 @@ void __init hugetlb_cma_reserve(int order)
reserved = 0;
for_each_node_state(nid, N_ONLINE) {
int res;
-   char name[20];
+   char name[CMA_MAX_NAME];
 
size = min(per_node, hugetlb_cma_size - reserved);
size = round_up(size, PAGE_SIZE << order);
 
-   snprintf(name, 20, "hugetlb%d", nid);
+   snprintf(name, sizeof(name), "hugetlb%d", nid);
res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order,
 0, false, name,
 _cma[nid], nid);
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v7 1/3] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Barry Song
Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get
coherent DMA buffers to save their command queues and page tables. As
there is only one default CMA in the whole system, SMMUs on nodes other
than node0 will get remote memory. This leads to significant latency.

This patch provides per-numa CMA so that drivers like SMMU can get local
memory. Tests show localizing CMA can decrease dma_unmap latency much.
For instance, before this patch, SMMU on node2  has to wait for more than
560ns for the completion of CMD_SYNC in an empty command queue; with this
patch, it needs 240ns only.

A positive side effect of this patch would be improving performance even
further for those users who are worried about performance more than DMA
security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all
drivers can get local coherent DMA buffers.

Cc: Jonathan Cameron 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Will Deacon 
Cc: Robin Murphy 
Cc: Ganapatrao Kulkarni 
Cc: Catalin Marinas 
Cc: Nicolas Saenz Julienne 
Cc: Steve Capper 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Signed-off-by: Barry Song 
---
 -v7: with respect to Will's comments
 * move to use for_each_online_node
 * add description if users don't specify pernuma_cma
 * provide default value for CONFIG_DMA_PERNUMA_CMA

 .../admin-guide/kernel-parameters.txt |  11 ++
 include/linux/dma-contiguous.h|   6 ++
 kernel/dma/Kconfig|  11 ++
 kernel/dma/contiguous.c   | 100 --
 4 files changed, 118 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index bdc1f33fd3d1..c609527fc35a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -599,6 +599,17 @@
altogether. For more information, see
include/linux/dma-contiguous.h
 
+   pernuma_cma=nn[MG]
+   [ARM64,KNL]
+   Sets the size of kernel per-numa memory area for
+   contiguous memory allocations. A value of 0 disables
+   per-numa CMA altogether. And If this option is not
+   specificed, the default value is 0.
+   With per-numa CMA enabled, DMA users on node nid will
+   first try to allocate buffer from the pernuma area
+   which is located in node nid, if the allocation fails,
+   they will fallback to the global default memory area.
+
cmo_free_hint=  [PPC] Format: { yes | no }
Specify whether pages are marked as being inactive
when they are freed.  This is used in CMO environments
diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h
index 03f8e98e3bcc..fe55e004f1f4 100644
--- a/include/linux/dma-contiguous.h
+++ b/include/linux/dma-contiguous.h
@@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct device *dev, 
struct page *page,
 
 #endif
 
+#ifdef CONFIG_DMA_PERNUMA_CMA
+void dma_pernuma_cma_reserve(void);
+#else
+static inline void dma_pernuma_cma_reserve(void) { }
+#endif
+
 #endif
 
 #endif
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index 847a9d1fa634..c38979d45b13 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -118,6 +118,17 @@ config DMA_CMA
  If unsure, say "n".
 
 if  DMA_CMA
+
+config DMA_PERNUMA_CMA
+   bool "Enable separate DMA Contiguous Memory Area for each NUMA Node"
+   default NUMA && ARM64
+   help
+ Enable this option to get pernuma CMA areas so that devices like
+ ARM64 SMMU can get local memory by DMA coherent APIs.
+
+ You can set the size of pernuma CMA by specifying "pernuma_cma=size"
+ on the kernel's command line.
+
 comment "Default contiguous memory area size:"
 
 config CMA_SIZE_MBYTES
diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index cff7e60968b9..0383c9b86715 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -69,6 +69,19 @@ static int __init early_cma(char *p)
 }
 early_param("cma", early_cma);
 
+#ifdef CONFIG_DMA_PERNUMA_CMA
+
+static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
+static phys_addr_t pernuma_size_bytes __initdata;
+
+static int __init early_pernuma_cma(char *p)
+{
+   pernuma_size_bytes = memparse(p, );
+   return 0;
+}
+early_param("pernuma_cma", early_pernuma_cma);
+#endif
+
 #ifdef CONFIG_CMA_SIZE_PERCENTAGE
 
 static phys_addr_t __init __maybe_unused cma_early_percent_memory(void)
@@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t 
cma_early_percent_memory(void)
 
 #endif
 
+#ifdef CONFIG_DMA_PERNUMA_CMA
+void __init dma_pernuma_cma_reserve(void)
+{
+ 

[PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

2020-08-21 Thread Barry Song
Ganapatrao Kulkarni has put some effort on making arm-smmu-v3 use local
memory to save command queues[1]. I also did similar job in patch
"iommu/arm-smmu-v3: allocate the memory of queues in local numa node"
[2] while not realizing Ganapatrao has done that before.

But it seems it is much better to make dma_alloc_coherent() to be
inherently NUMA-aware on NUMA-capable systems.

Right now, smmu is using dma_alloc_coherent() to get memory to save queues
and tables. Typically, on ARM64 server, there is a default CMA located at
node0, which could be far away from node2, node3 etc.
Saving queues and tables remotely will increase the latency of ARM SMMU
significantly. For example, when SMMU is at node2 and the default global
CMA is at node0, after sending a CMD_SYNC in an empty command queue, we
have to wait more than 550ns for the completion of the command CMD_SYNC.
However, if we save them locally, we only need to wait for 240ns.

with per-numa CMA, smmu will get memory from local numa node to save command
queues and page tables. that means dma_unmap latency will be shrunk much.

Meanwhile, when iommu.passthrough is on, device drivers which call dma_
alloc_coherent() will also get local memory and avoid the travel between
numa nodes.

[1] https://lists.linuxfoundation.org/pipermail/iommu/2017-October/024455.html
[2] https://www.spinics.net/lists/iommu/msg44767.html


-v7:
 * add Will's acked-by for the change in arch/arm64
 * some cleanup with respect to Will's comments
 * add patch 3/3 to remove the hardcode of defining the size of cma name.
   this patch requires some header file change in include/linux

-v6:
 * rebase on top of 5.9-rc1
 * doc cleanup

-v5:
 refine code according to Christoph Hellwig's comments
 * remove Kconfig option for pernuma cma size;
 * add Kconfig option for pernuma cma enable;
 * code cleanup like line over 80 char

 I haven't removed the cma NULL check code in cma_alloc() as it requires
 a bundle of other changes. So I prefer to handle this issue separately.

-v4:
 * rebase on top of Christoph Hellwig's patch:
 [PATCH v2] dma-contiguous: cleanup dma_alloc_contiguous
 https://lore.kernel.org/linux-iommu/20200723120133.94105-1-...@lst.de/
 * cleanup according to Christoph's comment
 * rebase on top of linux-next to avoid arch/arm64 conflicts
 * reserve cma by checking N_MEMORY rather than N_ONLINE

-v3:
  * move to use page_to_nid() while freeing cma with respect to Robin's
  comment, but this will only work after applying my below patch:
  "mm/cma.c: use exact_nid true to fix possible per-numa cma leak"
  https://marc.info/?l=linux-mm=159333034726647=2

  * handle the case count <= 1 more properly according to Robin's
  comment;

  * add pernuma_cma parameter to support dynamic setting of per-numa
  cma size;
  ideally we can leverage the CMA_SIZE_MBYTES, CMA_SIZE_PERCENTAGE and
  "cma=" kernel parameter and avoid a new paramter separately for per-
  numa cma. Practically, it is really too complicated considering the
  below problems:
  (1) if we leverage the size of default numa for per-numa, we have to
  avoid creating two cma with same size in node0 since default cma is
  probably on node0.
  (2) default cma can consider the address limitation for old devices
  while per-numa cma doesn't support GFP_DMA and GFP_DMA32. all
  allocations with limitation flags will fallback to default one.
  (3) hard to apply CMA_SIZE_PERCENTAGE to per-numa. it is hard to
  decide if the percentage should apply to the whole memory size
  or only apply to the memory size of a specific numa node.
  (4) default cma size has CMA_SIZE_SEL_MIN and CMA_SIZE_SEL_MAX, it
  makes things even more complicated to per-numa cma.

  I haven't figured out a good way to leverage the size of default cma
  for per-numa cma. it seems a separate parameter for per-numa could
  make life easier.

  * move dma_pernuma_cma_reserve() after hugetlb_cma_reserve() to
  reuse the comment before hugetlb_cma_reserve() with respect to
  Robin's comment

-v2: 
  * fix some issues reported by kernel test robot
  * fallback to default cma while allocation fails in per-numa cma
 free memory properly

Barry Song (3):
  dma-contiguous: provide the ability to reserve per-numa CMA
  arm64: mm: reserve per-numa CMA to localize coherent dma buffers
  mm: cma: use CMA_MAX_NAME to define the length of cma name array

 .../admin-guide/kernel-parameters.txt |  11 ++
 arch/arm64/mm/init.c  |   2 +
 include/linux/cma.h   |   2 +
 include/linux/dma-contiguous.h|   6 ++
 kernel/dma/Kconfig|  11 ++
 kernel/dma/contiguous.c   | 100 --
 mm/cma.h  |   2 -
 mm/hugetlb.c  |   4 +-
 8 files changed, 124 insertions(+), 14 deletions(-)

-- 
2.27.0


___
iommu mailing list
iommu

RE: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Will Deacon [mailto:w...@kernel.org]
> Sent: Friday, August 21, 2020 9:27 PM
> To: Song Bao Hua (Barry Song) 
> Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> iommu@lists.linux-foundation.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org;
> huangdaode ; Jonathan Cameron
> ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport 
> Subject: Re: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve
> per-numa CMA
> 
> On Fri, Aug 21, 2020 at 09:13:39AM +, Song Bao Hua (Barry Song) wrote:
> >
> >
> > > -Original Message-
> > > From: Will Deacon [mailto:w...@kernel.org]
> > > Sent: Friday, August 21, 2020 8:47 PM
> > > To: Song Bao Hua (Barry Song) 
> > > Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> > > ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> > > iommu@lists.linux-foundation.org; Linuxarm ;
> > > linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org;
> > > huangdaode ; Jonathan Cameron
> > > ; Nicolas Saenz Julienne
> > > ; Steve Capper ;
> > > Andrew Morton ; Mike Rapoport
> > > 
> > > Subject: Re: [PATCH v6 1/2] dma-contiguous: provide the ability to
> > > reserve per-numa CMA
> > >
> > > On Fri, Aug 21, 2020 at 02:26:14PM +1200, Barry Song wrote:
> > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> > > b/Documentation/admin-guide/kernel-parameters.txt
> > > > index bdc1f33fd3d1..3f33b89aeab5 100644
> > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > @@ -599,6 +599,15 @@
> > > > altogether. For more information, see
> > > > include/linux/dma-contiguous.h
> > > >
> > > > +   pernuma_cma=nn[MG]
> > > > +   [ARM64,KNL]
> > > > +   Sets the size of kernel per-numa memory area for
> > > > +   contiguous memory allocations. A value of 0 
> > > > disables
> > > > +   per-numa CMA altogether. DMA users on node nid 
> > > > will
> > > > +   first try to allocate buffer from the pernuma 
> > > > area
> > > > +   which is located in node nid, if the allocation 
> > > > fails,
> > > > +   they will fallback to the global default memory 
> > > > area.
> > >
> > > What is the default behaviour if this option is not specified? Seems
> > > like that should be mentioned here.
> 
> Just wanted to make sure you didn't miss this ^^

If it is not specified, the default size is 0 that means pernuma_cma is 
disabled.

Will put some words for this.

> 
> > >
> > > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index
> > > > 847a9d1fa634..db7a37ed35eb 100644
> > > > --- a/kernel/dma/Kconfig
> > > > +++ b/kernel/dma/Kconfig
> > > > @@ -118,6 +118,16 @@ config DMA_CMA
> > > >   If unsure, say "n".
> > > >
> > > >  if  DMA_CMA
> > > > +
> > > > +config DMA_PERNUMA_CMA
> > > > +   bool "Enable separate DMA Contiguous Memory Area for each
> NUMA
> > > Node"
> > >
> > > I don't understand the need for this config option. If you have
> > > DMA_DMA and you have NUMA, why wouldn't you want this enabled?
> >
> > Christoph preferred this in previous patchset in order to be able to
> > remove all of the code in the text if users don't use pernuma CMA.
> 
> Ok, I defer to Christoph here, but maybe a "default NUMA" might work?

maybe "default NUMA && ARM64"?
Though I believe it will benefit x86, but I don't have a x86 server hardware
and real scenario to test. So I haven't put the dma_pernuma_cma_reserve()
code in arch/x86.
Hopefully some x86 guys will bring it up and remove the "&& ARM64".

> 
> > > > +   help
> > > > + Enable this option to get pernuma CMA areas so that devices 
> > > > like
> > > > + ARM64 SMMU can get local memory by DMA coherent APIs.
> > > > +
> > > > + You can set the size of pernuma CMA by specifying

RE: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Will Deacon [mailto:w...@kernel.org]
> Sent: Friday, August 21, 2020 8:47 PM
> To: Song Bao Hua (Barry Song) 
> Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> iommu@lists.linux-foundation.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org;
> huangdaode ; Jonathan Cameron
> ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport 
> Subject: Re: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve
> per-numa CMA
> 
> On Fri, Aug 21, 2020 at 02:26:14PM +1200, Barry Song wrote:
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> > index bdc1f33fd3d1..3f33b89aeab5 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -599,6 +599,15 @@
> > altogether. For more information, see
> > include/linux/dma-contiguous.h
> >
> > +   pernuma_cma=nn[MG]
> > +   [ARM64,KNL]
> > +   Sets the size of kernel per-numa memory area for
> > +   contiguous memory allocations. A value of 0 disables
> > +   per-numa CMA altogether. DMA users on node nid will
> > +   first try to allocate buffer from the pernuma area
> > +   which is located in node nid, if the allocation fails,
> > +   they will fallback to the global default memory area.
> 
> What is the default behaviour if this option is not specified? Seems like
> that should be mentioned here.
> 
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index 847a9d1fa634..db7a37ed35eb 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -118,6 +118,16 @@ config DMA_CMA
> >   If unsure, say "n".
> >
> >  if  DMA_CMA
> > +
> > +config DMA_PERNUMA_CMA
> > +   bool "Enable separate DMA Contiguous Memory Area for each NUMA
> Node"
> 
> I don't understand the need for this config option. If you have DMA_DMA and
> you have NUMA, why wouldn't you want this enabled?

Christoph preferred this in previous patchset in order to be able to remove all 
of the code
in the text if users don't use pernuma CMA.

> 
> > +   help
> > + Enable this option to get pernuma CMA areas so that devices like
> > + ARM64 SMMU can get local memory by DMA coherent APIs.
> > +
> > + You can set the size of pernuma CMA by specifying
> "pernuma_cma=size"
> > + on the kernel's command line.
> > +
> >  comment "Default contiguous memory area size:"
> >
> >  config CMA_SIZE_MBYTES
> > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> > index cff7e60968b9..89b95f10e56d 100644
> > --- a/kernel/dma/contiguous.c
> > +++ b/kernel/dma/contiguous.c
> > @@ -69,6 +69,19 @@ static int __init early_cma(char *p)
> >  }
> >  early_param("cma", early_cma);
> >
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +
> > +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
> > +static phys_addr_t pernuma_size_bytes __initdata;
> > +
> > +static int __init early_pernuma_cma(char *p)
> > +{
> > +   pernuma_size_bytes = memparse(p, );
> > +   return 0;
> > +}
> > +early_param("pernuma_cma", early_pernuma_cma);
> > +#endif
> > +
> >  #ifdef CONFIG_CMA_SIZE_PERCENTAGE
> >
> >  static phys_addr_t __init __maybe_unused
> cma_early_percent_memory(void)
> > @@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t
> cma_early_percent_memory(void)
> >
> >  #endif
> >
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +void __init dma_pernuma_cma_reserve(void)
> > +{
> > +   int nid;
> > +
> > +   if (!pernuma_size_bytes)
> > +   return;
> 
> If this is useful (I assume it is), then I think we should have a non-zero
> default value, a bit like normal CMA does via CMA_SIZE_MBYTES.

The patchet used to have a CONFIG_PERNUMA_CMA_SIZE in kernel/dma/Kconfig, but 
Christoph was not comfortable
with it:
https://lore.kernel.org/linux-iommu/20200728115231.ga...@lst.de/

Would you mind to hardcode the value in CONFIG_CMDLINE in arch/arm64/Kconfig as 
Christoph mentioned:
config CMDLINE
default "pernuma_cma=16M"

If you also don't like the change in arch/arm64/Kconfig CMDLINE, I guess I have 
to depend on users' setting in
cmdline just

RE: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Randy Dunlap
> Sent: Friday, August 21, 2020 2:50 PM
> To: Song Bao Hua (Barry Song) ; h...@lst.de;
> m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com
> Cc: iommu@lists.linux-foundation.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org;
> huangdaode ; Jonathan Cameron
> ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport 
> Subject: Re: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve
> per-numa CMA
> 
> On 8/20/20 7:26 PM, Barry Song wrote:
> >
> >
> > Cc: Jonathan Cameron 
> > Cc: Christoph Hellwig 
> > Cc: Marek Szyprowski 
> > Cc: Will Deacon 
> > Cc: Robin Murphy 
> > Cc: Ganapatrao Kulkarni 
> > Cc: Catalin Marinas 
> > Cc: Nicolas Saenz Julienne 
> > Cc: Steve Capper 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Signed-off-by: Barry Song 
> > ---
> >  v6: rebase on top of 5.9-rc1;
> >  doc cleanup
> >
> >  .../admin-guide/kernel-parameters.txt |   9 ++
> >  include/linux/dma-contiguous.h|   6 ++
> >  kernel/dma/Kconfig|  10 ++
> >  kernel/dma/contiguous.c   | 100
> --
> >  4 files changed, 115 insertions(+), 10 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> > index bdc1f33fd3d1..3f33b89aeab5 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -599,6 +599,15 @@
> > altogether. For more information, see
> > include/linux/dma-contiguous.h
> >
> > +   pernuma_cma=nn[MG]
> 
> memparse() allows any one of these suffixes: K, M, G, T, P, E
> and nothing in the option parsing function cares what suffix is used...

Hello Randy,
Thanks for your comments.

Actually I am following the suffix of default cma:
cma=nn[MG]@[start[MG][-end[MG]]]
[ARM,X86,KNL]
Sets the size of kernel global memory area for
contiguous memory allocations and optionally the
placement constraint by the physical address range of
memory allocations. A value of 0 disables CMA
altogether. For more information, see
include/linux/dma-contiguous.h

I suggest users should set the size in either MB or GB as they set cma. 

> 
> > +   [ARM64,KNL]
> > +   Sets the size of kernel per-numa memory area for
> > +   contiguous memory allocations. A value of 0 disables
> > +   per-numa CMA altogether. DMA users on node nid will
> > +   first try to allocate buffer from the pernuma area
> > +   which is located in node nid, if the allocation fails,
> > +   they will fallback to the global default memory area.
> > +
> > cmo_free_hint=  [PPC] Format: { yes | no }
> > Specify whether pages are marked as being inactive
> > when they are freed.  This is used in CMO environments
> 
> > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> > index cff7e60968b9..89b95f10e56d 100644
> > --- a/kernel/dma/contiguous.c
> > +++ b/kernel/dma/contiguous.c
> > @@ -69,6 +69,19 @@ static int __init early_cma(char *p)
> >  }
> >  early_param("cma", early_cma);
> >
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +
> > +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
> > +static phys_addr_t pernuma_size_bytes __initdata;
> 
> why phys_addr_t? couldn't it just be unsigned long long?
> 

Mainly because of following the programming habit in kernel/dma/contiguous.c:
I think the original code probably meant the size should not be larger than the 
MAXIMUM
value of phys_addr_t:

/*
 * Default global CMA area size can be defined in kernel's .config.
 * This is useful mainly for distro maintainers to create a kernel
 * that works correctly for most supported systems.
 * The size can be set in bytes or as a percentage of the total memory
 * in the system.
 *
 * Users, who want to set the size of global CMA area for their system
 * should use cma= kernel parameter.
 */
static const phys_addr_t size_bytes __initconst =
(phys_addr_t)CMA_SIZE_MBYT

[PATCH v6 1/2] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-20 Thread Barry Song
Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get
coherent DMA buffers to save their command queues and page tables. As
there is only one default CMA in the whole system, SMMUs on nodes other
than node0 will get remote memory. This leads to significant latency.

This patch provides per-numa CMA so that drivers like SMMU can get local
memory. Tests show localizing CMA can decrease dma_unmap latency much.
For instance, before this patch, SMMU on node2 has to wait for more than
560ns for the completion of CMD_SYNC in an empty command queue; with this
patch, it needs 240ns only.

A positive side effect of this patch would be improving performance even
further for those users who are worried about performance more than DMA
security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all
drivers can get local coherent DMA buffers.

Cc: Jonathan Cameron 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Will Deacon 
Cc: Robin Murphy 
Cc: Ganapatrao Kulkarni 
Cc: Catalin Marinas 
Cc: Nicolas Saenz Julienne 
Cc: Steve Capper 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Signed-off-by: Barry Song 
---
 v6: rebase on top of 5.9-rc1;
 doc cleanup

 .../admin-guide/kernel-parameters.txt |   9 ++
 include/linux/dma-contiguous.h|   6 ++
 kernel/dma/Kconfig|  10 ++
 kernel/dma/contiguous.c   | 100 --
 4 files changed, 115 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index bdc1f33fd3d1..3f33b89aeab5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -599,6 +599,15 @@
altogether. For more information, see
include/linux/dma-contiguous.h
 
+   pernuma_cma=nn[MG]
+   [ARM64,KNL]
+   Sets the size of kernel per-numa memory area for
+   contiguous memory allocations. A value of 0 disables
+   per-numa CMA altogether. DMA users on node nid will
+   first try to allocate buffer from the pernuma area
+   which is located in node nid, if the allocation fails,
+   they will fallback to the global default memory area.
+
cmo_free_hint=  [PPC] Format: { yes | no }
Specify whether pages are marked as being inactive
when they are freed.  This is used in CMO environments
diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h
index 03f8e98e3bcc..fe55e004f1f4 100644
--- a/include/linux/dma-contiguous.h
+++ b/include/linux/dma-contiguous.h
@@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct device *dev, 
struct page *page,
 
 #endif
 
+#ifdef CONFIG_DMA_PERNUMA_CMA
+void dma_pernuma_cma_reserve(void);
+#else
+static inline void dma_pernuma_cma_reserve(void) { }
+#endif
+
 #endif
 
 #endif
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index 847a9d1fa634..db7a37ed35eb 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -118,6 +118,16 @@ config DMA_CMA
  If unsure, say "n".
 
 if  DMA_CMA
+
+config DMA_PERNUMA_CMA
+   bool "Enable separate DMA Contiguous Memory Area for each NUMA Node"
+   help
+ Enable this option to get pernuma CMA areas so that devices like
+ ARM64 SMMU can get local memory by DMA coherent APIs.
+
+ You can set the size of pernuma CMA by specifying "pernuma_cma=size"
+ on the kernel's command line.
+
 comment "Default contiguous memory area size:"
 
 config CMA_SIZE_MBYTES
diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index cff7e60968b9..89b95f10e56d 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -69,6 +69,19 @@ static int __init early_cma(char *p)
 }
 early_param("cma", early_cma);
 
+#ifdef CONFIG_DMA_PERNUMA_CMA
+
+static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
+static phys_addr_t pernuma_size_bytes __initdata;
+
+static int __init early_pernuma_cma(char *p)
+{
+   pernuma_size_bytes = memparse(p, );
+   return 0;
+}
+early_param("pernuma_cma", early_pernuma_cma);
+#endif
+
 #ifdef CONFIG_CMA_SIZE_PERCENTAGE
 
 static phys_addr_t __init __maybe_unused cma_early_percent_memory(void)
@@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t 
cma_early_percent_memory(void)
 
 #endif
 
+#ifdef CONFIG_DMA_PERNUMA_CMA
+void __init dma_pernuma_cma_reserve(void)
+{
+   int nid;
+
+   if (!pernuma_size_bytes)
+   return;
+
+   for_each_node_state(nid, N_ONLINE) {
+   int ret;
+   char name[20];
+   struct cma **cma = _contiguous_pernuma_area[nid];
+
+   snprintf(name, sizeof(name), "pernuma%d", nid);
+ 

[PATCH v6 0/2] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

2020-08-20 Thread Barry Song
Ganapatrao Kulkarni has put some effort on making arm-smmu-v3 use local
memory to save command queues[1]. I also did similar job in patch
"iommu/arm-smmu-v3: allocate the memory of queues in local numa node"
[2] while not realizing Ganapatrao has done that before.

But it seems it is much better to make dma_alloc_coherent() to be
inherently NUMA-aware on NUMA-capable systems.

Right now, smmu is using dma_alloc_coherent() to get memory to save queues
and tables. Typically, on ARM64 server, there is a default CMA located at
node0, which could be far away from node2, node3 etc.
Saving queues and tables remotely will increase the latency of ARM SMMU
significantly. For example, when SMMU is at node2 and the default global
CMA is at node0, after sending a CMD_SYNC in an empty command queue, we
have to wait more than 550ns for the completion of the command CMD_SYNC.
However, if we save them locally, we only need to wait for 240ns.

with per-numa CMA, smmu will get memory from local numa node to save command
queues and page tables. that means dma_unmap latency will be shrunk much.

Meanwhile, when iommu.passthrough is on, device drivers which call dma_
alloc_coherent() will also get local memory and avoid the travel between
numa nodes.

I only have ARM64 server platforms to test, but I believe this patch will
benefit X86 somehow. Hopefully, some X86 guys will bring it up on x86.

[1] https://lists.linuxfoundation.org/pipermail/iommu/2017-October/024455.html
[2] https://www.spinics.net/lists/iommu/msg44767.html


-v6:
 * rebase on top of 5.9-rc1
 * doc cleanup

-v5:
 refine code according to Christoph Hellwig's comments
 * remove Kconfig option for pernuma cma size;
 * add Kconfig option for pernuma cma enable;
 * code cleanup like line over 80 char

 I haven't removed the cma NULL check code in cma_alloc() as it requires
 a bundle of other changes. So I prefer to handle this issue separately.

-v4:
 * rebase on top of Christoph Hellwig's patch:
 [PATCH v2] dma-contiguous: cleanup dma_alloc_contiguous
 https://lore.kernel.org/linux-iommu/20200723120133.94105-1-...@lst.de/
 * cleanup according to Christoph's comment
 * rebase on top of linux-next to avoid arch/arm64 conflicts
 * reserve cma by checking N_MEMORY rather than N_ONLINE

-v3:
  * move to use page_to_nid() while freeing cma with respect to Robin's
  comment, but this will only work after applying my below patch:
  "mm/cma.c: use exact_nid true to fix possible per-numa cma leak"
  https://marc.info/?l=linux-mm=159333034726647=2

  * handle the case count <= 1 more properly according to Robin's
  comment;

  * add pernuma_cma parameter to support dynamic setting of per-numa
  cma size;
  ideally we can leverage the CMA_SIZE_MBYTES, CMA_SIZE_PERCENTAGE and
  "cma=" kernel parameter and avoid a new paramter separately for per-
  numa cma. Practically, it is really too complicated considering the
  below problems:
  (1) if we leverage the size of default numa for per-numa, we have to
  avoid creating two cma with same size in node0 since default cma is
  probably on node0.
  (2) default cma can consider the address limitation for old devices
  while per-numa cma doesn't support GFP_DMA and GFP_DMA32. all
  allocations with limitation flags will fallback to default one.
  (3) hard to apply CMA_SIZE_PERCENTAGE to per-numa. it is hard to
  decide if the percentage should apply to the whole memory size
  or only apply to the memory size of a specific numa node.
  (4) default cma size has CMA_SIZE_SEL_MIN and CMA_SIZE_SEL_MAX, it
  makes things even more complicated to per-numa cma.

  I haven't figured out a good way to leverage the size of default cma
  for per-numa cma. it seems a separate parameter for per-numa could
  make life easier.

  * move dma_pernuma_cma_reserve() after hugetlb_cma_reserve() to
  reuse the comment before hugetlb_cma_reserve() with respect to
  Robin's comment

-v2: 
  * fix some issues reported by kernel test robot
  * fallback to default cma while allocation fails in per-numa cma
 free memory properly

Barry Song (2):
  dma-contiguous: provide the ability to reserve per-numa CMA
  arm64: mm: reserve per-numa CMA to localize coherent dma buffers

 .../admin-guide/kernel-parameters.txt |   9 ++
 arch/arm64/mm/init.c  |   2 +
 include/linux/dma-contiguous.h|   6 ++
 kernel/dma/Kconfig|  10 ++
 kernel/dma/contiguous.c   | 100 --
 5 files changed, 117 insertions(+), 10 deletions(-)

-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 2/2] arm64: mm: reserve per-numa CMA to localize coherent dma buffers

2020-08-20 Thread Barry Song
Right now, smmu is using dma_alloc_coherent() to get memory to save queues
and tables. Typically, on ARM64 server, there is a default CMA located at
node0, which could be far away from node2, node3 etc.
with this patch, smmu will get memory from local numa node to save command
queues and page tables. that means dma_unmap latency will be shrunk much.
Meanwhile, when iommu.passthrough is on, device drivers which call dma_
alloc_coherent() will also get local memory and avoid the travel between
numa nodes.

Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Will Deacon 
Cc: Robin Murphy 
Cc: Ganapatrao Kulkarni 
Cc: Catalin Marinas 
Cc: Nicolas Saenz Julienne 
Cc: Steve Capper 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Signed-off-by: Barry Song 
---
 -v6: rebase on top of 5.9-rc1

 arch/arm64/mm/init.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 481d22c32a2e..f1c75957ff3c 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -429,6 +429,8 @@ void __init bootmem_init(void)
arm64_hugetlb_cma_reserve();
 #endif
 
+   dma_pernuma_cma_reserve();
+
/*
 * sparse_init() tries to allocate memory from memblock, so must be
 * done after the fixed reservations
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v4] iommu/arm-smmu-v3: permit users to disable msi polling

2020-08-18 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Wednesday, August 19, 2020 2:31 AM
> To: Song Bao Hua (Barry Song) ; w...@kernel.org;
> j...@8bytes.org
> Cc: Zengtao (B) ;
> iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> Linuxarm 
> Subject: Re: [PATCH v4] iommu/arm-smmu-v3: permit users to disable msi
> polling
> 
> On 2020-08-18 12:17, Barry Song wrote:
> > Polling by MSI isn't necessarily faster than polling by SEV. Tests on
> > hi1620 show hns3 100G NIC network throughput can improve from 25G to
> > 27G if we disable MSI polling while running 16 netperf threads sending
> > UDP packets in size 32KB. TX throughput can improve from 7G to 7.7G for
> > single thread.
> > The reason for the throughput improvement is that the latency to poll
> > the completion of CMD_SYNC becomes smaller. After sending a CMD_SYNC
> > in an empty cmd queue, typically we need to wait for 280ns using MSI
> > polling. But we only need around 190ns after disabling MSI polling.
> > This patch provides a command line option so that users can decide to
> > use MSI polling or not based on their tests.
> >
> > Signed-off-by: Barry Song 
> > ---
> >   -v4: rebase on top of 5.9-rc1
> >   refine changelog
> >
> >   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18
> ++
> >   1 file changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 7196207be7ea..89d3cb391fef 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -418,6 +418,11 @@ module_param_named(disable_bypass,
> disable_bypass, bool, S_IRUGO);
> >   MODULE_PARM_DESC(disable_bypass,
> > "Disable bypass streams such that incoming transactions from devices
> that are not attached to an iommu domain will report an abort back to the
> device and will not be allowed to pass through the SMMU.");
> >
> > +static bool disable_msipolling;
> > +module_param_named(disable_msipolling, disable_msipolling, bool,
> S_IRUGO);
> 
> Just use module_param() - going out of the way to specify a "different"
> name that's identical to the variable name is silly.

Thanks for pointing out, also fixed the same issue in the existing parameter
disable_bypass in the new patchset.

But I am sorry I made a typo, the new patchset should be v5. But I wrote v4.

> 
> Also I think the preference these days is to specify permissions as
> plain octal constants rather than those rather inscrutable macros. I
> certainly find that more readable myself.
> 
> (Yes, the existing parameter commits the same offences, but I'd rather
> clean that up separately than perpetuate it)

Thanks for pointing out. Got fixed in the new patchset.

> 
> > +MODULE_PARM_DESC(disable_msipolling,
> > +   "Disable MSI-based polling for CMD_SYNC completion.");
> > +
> >   enum pri_resp {
> > PRI_RESP_DENY = 0,
> > PRI_RESP_FAIL = 1,
> > @@ -980,6 +985,13 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd,
> struct arm_smmu_cmdq_ent *ent)
> > return 0;
> >   }
> >
> > +static bool arm_smmu_use_msipolling(struct arm_smmu_device *smmu)
> > +{
> > +   return !disable_msipolling &&
> > +  smmu->features & ARM_SMMU_FEAT_COHERENCY &&
> > +  smmu->features & ARM_SMMU_FEAT_MSI;
> > +}
> 
> I'd wrap this up into a new ARM_SMMU_OPT_MSIPOLL flag set at probe time,
> rather than constantly reevaluating this whole expression (now that it's
> no longer simply testing two adjacent bits of the same word).

Got it done in the new patchset. It turns out we only need to check one bit now 
with the new
patch:

-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
-   smmu->features & ARM_SMMU_FEAT_COHERENCY)
+   if (smmu->options & ARM_SMMU_OPT_MSIPOLL)
return __arm_smmu_cmdq_poll_until_msi(smmu, llq);


> 
> Robin.
> 
> > +
> >   static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct
> arm_smmu_device *smmu,
> >  u32 prod)
> >   {
> > @@ -992,8 +1004,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64
> *cmd, struct arm_smmu_device *smmu,
> >  * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI
> >  * payload, so the write will zero the entire command on that platform.
> >  */
> > -   if (smmu->features & ARM_SMMU_FEAT_MSI &&
> > -   smmu-

[PATCH v4 1/3] iommu/arm-smmu-v3: replace symbolic permissions by octal permissions for module parameter

2020-08-18 Thread Barry Song
This fixed the below checkpatch issue:
WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using
octal permissions '0444'.
417: FILE: drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:417:
module_param_named(disable_bypass, disable_bypass, bool, S_IRUGO);

-v4:
   * cleanup the existing module parameter of bypass_
   * add ARM_SMMU_OPT_MSIPOLL flag with respect to Robin's comments

Signed-off-by: Barry Song 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7196207be7ea..eea5f7c6d9ab 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -414,7 +414,7 @@
 #define MSI_IOVA_LENGTH0x10
 
 static bool disable_bypass = 1;
-module_param_named(disable_bypass, disable_bypass, bool, S_IRUGO);
+module_param_named(disable_bypass, disable_bypass, bool, 0444);
 MODULE_PARM_DESC(disable_bypass,
"Disable bypass streams such that incoming transactions from devices 
that are not attached to an iommu domain will report an abort back to the 
device and will not be allowed to pass through the SMMU.");
 
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 3/3] iommu/arm-smmu-v3: permit users to disable msi polling

2020-08-18 Thread Barry Song
Polling by MSI isn't necessarily faster than polling by SEV. Tests on
hi1620 show hns3 100G NIC network throughput can improve from 25G to
27G if we disable MSI polling while running 16 netperf threads sending
UDP packets in size 32KB. TX throughput can improve from 7G to 7.7G for
single thread.
The reason for the throughput improvement is that the latency to poll
the completion of CMD_SYNC becomes smaller. After sending a CMD_SYNC
in an empty cmd queue, typically we need to wait for 280ns using MSI
polling. But we only need around 190ns after disabling MSI polling.
This patch provides a command line option so that users can decide to
use MSI polling or not based on their tests.

Signed-off-by: Barry Song 
---
 -v4: add ARM_SMMU_OPT_MSIPOLL flag with respect to Robin's comment

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5b40d535a7c8..7332251dd8cd 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -418,6 +418,11 @@ module_param(disable_bypass, bool, 0444);
 MODULE_PARM_DESC(disable_bypass,
"Disable bypass streams such that incoming transactions from devices 
that are not attached to an iommu domain will report an abort back to the 
device and will not be allowed to pass through the SMMU.");
 
+static bool disable_msipolling;
+module_param(disable_msipolling, bool, 0444);
+MODULE_PARM_DESC(disable_msipolling,
+   "Disable MSI-based polling for CMD_SYNC completion.");
+
 enum pri_resp {
PRI_RESP_DENY = 0,
PRI_RESP_FAIL = 1,
@@ -652,6 +657,7 @@ struct arm_smmu_device {
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0)
 #define ARM_SMMU_OPT_PAGE0_REGS_ONLY   (1 << 1)
+#define ARM_SMMU_OPT_MSIPOLL   (1 << 2)
u32 options;
 
struct arm_smmu_cmdqcmdq;
@@ -992,8 +998,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct 
arm_smmu_device *smmu,
 * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI
 * payload, so the write will zero the entire command on that platform.
 */
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
-   smmu->features & ARM_SMMU_FEAT_COHERENCY) {
+   if (smmu->options & ARM_SMMU_OPT_MSIPOLL) {
ent.sync.msiaddr = q->base_dma + Q_IDX(>llq, prod) *
   q->ent_dwords * 8;
}
@@ -1332,8 +1337,7 @@ static int __arm_smmu_cmdq_poll_until_consumed(struct 
arm_smmu_device *smmu,
 static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device *smmu,
 struct arm_smmu_ll_queue *llq)
 {
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
-   smmu->features & ARM_SMMU_FEAT_COHERENCY)
+   if (smmu->options & ARM_SMMU_OPT_MSIPOLL)
return __arm_smmu_cmdq_poll_until_msi(smmu, llq);
 
return __arm_smmu_cmdq_poll_until_consumed(smmu, llq);
@@ -3741,8 +3745,11 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
if (reg & IDR0_SEV)
smmu->features |= ARM_SMMU_FEAT_SEV;
 
-   if (reg & IDR0_MSI)
+   if (reg & IDR0_MSI) {
smmu->features |= ARM_SMMU_FEAT_MSI;
+   if (coherent && !disable_msipolling)
+   smmu->options |= ARM_SMMU_OPT_MSIPOLL;
+   }
 
if (reg & IDR0_HYP)
smmu->features |= ARM_SMMU_FEAT_HYP;
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 0/3] iommu/arm-smmu-v3: permit users to disable msi polling

2020-08-18 Thread Barry Song
patch 1/3 and patch 2/3 are the preparation of patch 3/3 which permits users
to disable MSI-based polling by cmd line.

-v4:
  with respect to Robin's comments
  * cleanup the code of the existing module parameter disable_bypass
  * add ARM_SMMU_OPT_MSIPOLL flag. on the other hand, we only need to check
a bit in options rather than two bits in features

Barry Song (3):
  iommu/arm-smmu-v3: replace symbolic permissions by octal permissions
for module parameter
  iommu/arm-smmu-v3: replace module_param_named by module_param for
disable_bypass
  iommu/arm-smmu-v3: permit users to disable msi polling

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 2/3] iommu/arm-smmu-v3: replace module_param_named by module_param for disable_bypass

2020-08-18 Thread Barry Song
Just use module_param() - going out of the way to specify a "different"
name that's identical to the variable name is silly.

Signed-off-by: Barry Song 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index eea5f7c6d9ab..5b40d535a7c8 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -414,7 +414,7 @@
 #define MSI_IOVA_LENGTH0x10
 
 static bool disable_bypass = 1;
-module_param_named(disable_bypass, disable_bypass, bool, 0444);
+module_param(disable_bypass, bool, 0444);
 MODULE_PARM_DESC(disable_bypass,
"Disable bypass streams such that incoming transactions from devices 
that are not attached to an iommu domain will report an abort back to the 
device and will not be allowed to pass through the SMMU.");
 
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4] iommu/arm-smmu-v3: permit users to disable msi polling

2020-08-18 Thread Barry Song
Polling by MSI isn't necessarily faster than polling by SEV. Tests on
hi1620 show hns3 100G NIC network throughput can improve from 25G to
27G if we disable MSI polling while running 16 netperf threads sending
UDP packets in size 32KB. TX throughput can improve from 7G to 7.7G for
single thread.
The reason for the throughput improvement is that the latency to poll
the completion of CMD_SYNC becomes smaller. After sending a CMD_SYNC
in an empty cmd queue, typically we need to wait for 280ns using MSI
polling. But we only need around 190ns after disabling MSI polling.
This patch provides a command line option so that users can decide to
use MSI polling or not based on their tests.

Signed-off-by: Barry Song 
---
 -v4: rebase on top of 5.9-rc1
 refine changelog

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7196207be7ea..89d3cb391fef 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -418,6 +418,11 @@ module_param_named(disable_bypass, disable_bypass, bool, 
S_IRUGO);
 MODULE_PARM_DESC(disable_bypass,
"Disable bypass streams such that incoming transactions from devices 
that are not attached to an iommu domain will report an abort back to the 
device and will not be allowed to pass through the SMMU.");
 
+static bool disable_msipolling;
+module_param_named(disable_msipolling, disable_msipolling, bool, S_IRUGO);
+MODULE_PARM_DESC(disable_msipolling,
+   "Disable MSI-based polling for CMD_SYNC completion.");
+
 enum pri_resp {
PRI_RESP_DENY = 0,
PRI_RESP_FAIL = 1,
@@ -980,6 +985,13 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct 
arm_smmu_cmdq_ent *ent)
return 0;
 }
 
+static bool arm_smmu_use_msipolling(struct arm_smmu_device *smmu)
+{
+   return !disable_msipolling &&
+  smmu->features & ARM_SMMU_FEAT_COHERENCY &&
+  smmu->features & ARM_SMMU_FEAT_MSI;
+}
+
 static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct arm_smmu_device 
*smmu,
 u32 prod)
 {
@@ -992,8 +1004,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct 
arm_smmu_device *smmu,
 * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI
 * payload, so the write will zero the entire command on that platform.
 */
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
-   smmu->features & ARM_SMMU_FEAT_COHERENCY) {
+   if (arm_smmu_use_msipolling(smmu)) {
ent.sync.msiaddr = q->base_dma + Q_IDX(>llq, prod) *
   q->ent_dwords * 8;
}
@@ -1332,8 +1343,7 @@ static int __arm_smmu_cmdq_poll_until_consumed(struct 
arm_smmu_device *smmu,
 static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device *smmu,
 struct arm_smmu_ll_queue *llq)
 {
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
-   smmu->features & ARM_SMMU_FEAT_COHERENCY)
+   if (arm_smmu_use_msipolling(smmu))
return __arm_smmu_cmdq_poll_until_msi(smmu, llq);
 
return __arm_smmu_cmdq_poll_until_consumed(smmu, llq);
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3] iommu/arm-smmu-v3: permit users to disable MSI polling

2020-08-03 Thread Song Bao Hua (Barry Song)
> -Original Message-
> From: John Garry
> Sent: Tuesday, August 4, 2020 3:34 AM
> To: Song Bao Hua (Barry Song) ; w...@kernel.org;
> robin.mur...@arm.com; j...@8bytes.org; iommu@lists.linux-foundation.org
> Cc: Zengtao (B) ;
> linux-arm-ker...@lists.infradead.org
> Subject: Re: [PATCH v3] iommu/arm-smmu-v3: permit users to disable MSI
> polling
> 
> On 01/08/2020 08:47, Barry Song wrote:
> > Polling by MSI isn't necessarily faster than polling by SEV. Tests on
> > hi1620 show hns3 100G NIC network throughput can improve from 25G to
> > 27G if we disable MSI polling while running 16 netperf threads sending
> > UDP packets in size 32KB.
> 
> BTW, Do we have any more results than this? This is just one scenario.
> 

John, it is more than a scenario. Micro-benchmark shows polling by SEV has less 
latency
than MSI. This motivated me to use a real scenario to verify. For this network 
case, if we set
thread to 1 rather than 16, network TX through can improve from 7Gbps to 7.7Gbps

> How about your micro-benchmark, which allows you to set the number of
> CPUs?

The micro-benchmark is working like this:
Sending A CMD_SYNC in an empty command queue
Polling the completion of this CMD_SYNC by MSI or SEV.

I have seen the polling latency can decrease by about 80ns. Without this patch,
the latency was about ~270ns, after this patch, it would be about
~190ns.

> 
> Thanks,
> John
> 
> > This patch provides a command line option so that users can decide to
> > use MSI polling or not based on their tests.
> >
> > Signed-off-by: Barry Song 
> > ---
> >   -v3:
> >* rebase on top of linux-next as arm-smmu-v3.c has moved;
> >* provide a command line option
> >
> >   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18
> ++
> >   1 file changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 7196207be7ea..89d3cb391fef 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -418,6 +418,11 @@ module_param_named(disable_bypass,
> disable_bypass, bool, S_IRUGO);
> >   MODULE_PARM_DESC(disable_bypass,
> > "Disable bypass streams such that incoming transactions from devices
> that are not attached to an iommu domain will report an abort back to the
> device and will not be allowed to pass through the SMMU.");
> >
> > +static bool disable_msipolling;
> > +module_param_named(disable_msipolling, disable_msipolling, bool,
> S_IRUGO);
> > +MODULE_PARM_DESC(disable_msipolling,
> > +   "Disable MSI-based polling for CMD_SYNC completion.");
> > +
> >   enum pri_resp {
> > PRI_RESP_DENY = 0,
> > PRI_RESP_FAIL = 1,
> > @@ -980,6 +985,13 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd,
> struct arm_smmu_cmdq_ent *ent)
> > return 0;
> >   }
> >
> > +static bool arm_smmu_use_msipolling(struct arm_smmu_device *smmu)
> > +{
> > +   return !disable_msipolling &&
> > +  smmu->features & ARM_SMMU_FEAT_COHERENCY &&
> > +  smmu->features & ARM_SMMU_FEAT_MSI;
> > +}
> > +
> >   static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct
> arm_smmu_device *smmu,
> >  u32 prod)
> >   {
> > @@ -992,8 +1004,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64
> *cmd, struct arm_smmu_device *smmu,
> >  * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI
> >  * payload, so the write will zero the entire command on that platform.
> >  */
> > -   if (smmu->features & ARM_SMMU_FEAT_MSI &&
> > -   smmu->features & ARM_SMMU_FEAT_COHERENCY) {
> > +   if (arm_smmu_use_msipolling(smmu)) {
> > ent.sync.msiaddr = q->base_dma + Q_IDX(>llq, prod) *
> >q->ent_dwords * 8;
> > }
> > @@ -1332,8 +1343,7 @@ static int
> __arm_smmu_cmdq_poll_until_consumed(struct arm_smmu_device *smmu,
> >   static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device
> *smmu,
> >  struct arm_smmu_ll_queue *llq)
> >   {
> > -   if (smmu->features & ARM_SMMU_FEAT_MSI &&
> > -   smmu->features & ARM_SMMU_FEAT_COHERENCY)
> > +   if (arm_smmu_use_msipolling(smmu))
> > return __arm_smmu_cmdq_poll_until_msi(smmu, llq);
> >
> > return __arm_smmu_cmdq_poll_until_consumed(smmu, llq);
> >

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3] iommu/arm-smmu-v3: permit users to disable MSI polling

2020-08-01 Thread Barry Song
Polling by MSI isn't necessarily faster than polling by SEV. Tests on
hi1620 show hns3 100G NIC network throughput can improve from 25G to
27G if we disable MSI polling while running 16 netperf threads sending
UDP packets in size 32KB.
This patch provides a command line option so that users can decide to
use MSI polling or not based on their tests.

Signed-off-by: Barry Song 
---
 -v3:
  * rebase on top of linux-next as arm-smmu-v3.c has moved;
  * provide a command line option

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7196207be7ea..89d3cb391fef 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -418,6 +418,11 @@ module_param_named(disable_bypass, disable_bypass, bool, 
S_IRUGO);
 MODULE_PARM_DESC(disable_bypass,
"Disable bypass streams such that incoming transactions from devices 
that are not attached to an iommu domain will report an abort back to the 
device and will not be allowed to pass through the SMMU.");
 
+static bool disable_msipolling;
+module_param_named(disable_msipolling, disable_msipolling, bool, S_IRUGO);
+MODULE_PARM_DESC(disable_msipolling,
+   "Disable MSI-based polling for CMD_SYNC completion.");
+
 enum pri_resp {
PRI_RESP_DENY = 0,
PRI_RESP_FAIL = 1,
@@ -980,6 +985,13 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct 
arm_smmu_cmdq_ent *ent)
return 0;
 }
 
+static bool arm_smmu_use_msipolling(struct arm_smmu_device *smmu)
+{
+   return !disable_msipolling &&
+  smmu->features & ARM_SMMU_FEAT_COHERENCY &&
+  smmu->features & ARM_SMMU_FEAT_MSI;
+}
+
 static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct arm_smmu_device 
*smmu,
 u32 prod)
 {
@@ -992,8 +1004,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct 
arm_smmu_device *smmu,
 * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI
 * payload, so the write will zero the entire command on that platform.
 */
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
-   smmu->features & ARM_SMMU_FEAT_COHERENCY) {
+   if (arm_smmu_use_msipolling(smmu)) {
ent.sync.msiaddr = q->base_dma + Q_IDX(>llq, prod) *
   q->ent_dwords * 8;
}
@@ -1332,8 +1343,7 @@ static int __arm_smmu_cmdq_poll_until_consumed(struct 
arm_smmu_device *smmu,
 static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device *smmu,
 struct arm_smmu_ll_queue *llq)
 {
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
-   smmu->features & ARM_SMMU_FEAT_COHERENCY)
+   if (arm_smmu_use_msipolling(smmu))
return __arm_smmu_cmdq_poll_until_msi(smmu, llq);
 
return __arm_smmu_cmdq_poll_until_consumed(smmu, llq);
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 2/2] arm64: mm: reserve per-numa CMA to localize coherent dma buffers

2020-07-31 Thread Barry Song
Right now, smmu is using dma_alloc_coherent() to get memory to save queues
and tables. Typically, on ARM64 server, there is a default CMA located at
node0, which could be far away from node2, node3 etc.
with this patch, smmu will get memory from local numa node to save command
queues and page tables. that means dma_unmap latency will be shrunk much.
Meanwhile, when iommu.passthrough is on, device drivers which call dma_
alloc_coherent() will also get local memory and avoid the travel between
numa nodes.

Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Will Deacon 
Cc: Robin Murphy 
Cc: Ganapatrao Kulkarni 
Cc: Catalin Marinas 
Cc: Nicolas Saenz Julienne 
Cc: Steve Capper 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Signed-off-by: Barry Song 
---
 arch/arm64/mm/init.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index b6881d61b818..a6e19145ebb3 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -437,6 +437,8 @@ void __init bootmem_init(void)
arm64_hugetlb_cma_reserve();
 #endif
 
+   dma_pernuma_cma_reserve();
+
memblock_dump_all();
 }
 
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 1/2] dma-contiguous: provide the ability to reserve per-numa CMA

2020-07-31 Thread Barry Song
Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get
coherent DMA buffers to save their command queues and page tables. As
there is only one default CMA in the whole system, SMMUs on nodes other
than node0 will get remote memory. This leads to significant latency.

This patch provides per-numa CMA so that drivers like SMMU can get local
memory. Tests show localizing CMA can decrease dma_unmap latency much.
For instance, before this patch, SMMU on node2  has to wait for more than
560ns for the completion of CMD_SYNC in an empty command queue; with this
patch, it needs 240ns only.

A positive side effect of this patch would be improving performance even
further for those users who are worried about performance more than DMA
security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all
drivers can get local coherent DMA buffers.

Cc: Jonathan Cameron 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Will Deacon 
Cc: Robin Murphy 
Cc: Ganapatrao Kulkarni 
Cc: Catalin Marinas 
Cc: Nicolas Saenz Julienne 
Cc: Steve Capper 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Signed-off-by: Barry Song 
---
 -v5:
 refine code according to Christoph Hellwig's comments
 * remove Kconfig option for pernuma cma size;
 * add Kconfig option for pernuma cma enable;
 * code cleanup like line over 80 char

 I haven't removed the cma NULL check code in cma_alloc() as it requires
 a bundle of other changes. So I prefer to handle this issue separately.

 .../admin-guide/kernel-parameters.txt |   9 ++
 include/linux/dma-contiguous.h|   6 ++
 kernel/dma/Kconfig|  10 ++
 kernel/dma/contiguous.c   | 100 --
 4 files changed, 115 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index fc81ece1b5aa..adad5e944600 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -599,6 +599,15 @@
altogether. For more information, see
include/linux/dma-contiguous.h
 
+   pernuma_cma=nn[MG]@[start[MG][-end[MG]]]
+   [ARM,X86,KNL]
+   Sets the size of kernel per-numa memory area for
+   contiguous memory allocations. A value of 0 disables
+   per-numa CMA altogether. DMA users on node nid will
+   first try to allocate buffer from the pernuma area
+   which is located in node nid, if the allocation fails,
+   they will fallback to the global default memory area.
+
cmo_free_hint=  [PPC] Format: { yes | no }
Specify whether pages are marked as being inactive
when they are freed.  This is used in CMO environments
diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h
index 03f8e98e3bcc..fe55e004f1f4 100644
--- a/include/linux/dma-contiguous.h
+++ b/include/linux/dma-contiguous.h
@@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct device *dev, 
struct page *page,
 
 #endif
 
+#ifdef CONFIG_DMA_PERNUMA_CMA
+void dma_pernuma_cma_reserve(void);
+#else
+static inline void dma_pernuma_cma_reserve(void) { }
+#endif
+
 #endif
 
 #endif
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index 5d18456b5f01..fc20b8f3ef44 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -118,6 +118,16 @@ config DMA_CMA
  If unsure, say "n".
 
 if  DMA_CMA
+
+config DMA_PERNUMA_CMA
+   bool "Enable separate DMA Contiguous Memory Area for each NUMA Node"
+   help
+ Enable this option to get pernuma CMA areas so that devices like
+ ARM SMMU can get local memory by DMA coherent APIs.
+
+ You can disable pernuma CMA by specifying "pernuma_cma=0" on the
+ kernel's command line.
+
 comment "Default contiguous memory area size:"
 
 config CMA_SIZE_MBYTES
diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index cff7e60968b9..67cdd2cb4949 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -69,6 +69,19 @@ static int __init early_cma(char *p)
 }
 early_param("cma", early_cma);
 
+#ifdef CONFIG_DMA_PERNUMA_CMA
+
+static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
+static phys_addr_t pernuma_size_bytes __initdata;
+
+static int __init early_pernuma_cma(char *p)
+{
+   pernuma_size_bytes = memparse(p, );
+   return 0;
+}
+early_param("pernuma_cma", early_pernuma_cma);
+#endif
+
 #ifdef CONFIG_CMA_SIZE_PERCENTAGE
 
 static phys_addr_t __init __maybe_unused cma_early_percent_memory(void)
@@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t 
cma_early_percent_memory(void)
 
 #endif
 
+#ifdef CONFIG_DMA_PERNUMA_CMA
+void __init dma_pernuma_cma_reserve(void)
+{
+   int ni

  1   2   >