RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-10-31 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song) [mailto:song.bao@hisilicon.com]
> Sent: Saturday, October 31, 2020 10:45 PM
> To: Robin Murphy ;
> iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com
> Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm
> ; linux-kselft...@vger.kernel.org
> Subject: RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming
> DMA APIs
> 
> 
> 
> > -Original Message-
> > From: Robin Murphy [mailto:robin.mur...@arm.com]
> > Sent: Saturday, October 31, 2020 4:48 AM
> > To: Song Bao Hua (Barry Song) ;
> > iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com
> > Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm
> > ; linux-kselft...@vger.kernel.org
> > Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for
> streaming
> > DMA APIs
> >
> > On 2020-10-29 21:39, Song Bao Hua (Barry Song) wrote:
> > [...]
> > >>> +struct map_benchmark {
> > >>> +   __u64 map_nsec;
> > >>> +   __u64 unmap_nsec;
> > >>> +   __u32 threads; /* how many threads will do map/unmap in parallel
> > */
> > >>> +   __u32 seconds; /* how long the test will last */
> > >>> +   int node; /* which numa node this benchmark will run on */
> > >>> +   __u64 expansion[10];/* For future use */
> > >>> +};
> > >>
> > >> I'm no expert on userspace ABIs (and what little experience I do have
> > >> is mostly of Win32...), so hopefully someone else will comment if
> > >> there's anything of concern here. One thing I wonder is that there's
> > >> a fair likelihood of functionality evolving here over time, so might
> > >> it be appropriate to have some sort of explicit versioning parameter
> > >> for robustness?
> > >
> > > I copied that from gup_benchmark. There is no this kind of code to
> > > compare version.
> > > I believe there is a likelihood that kernel module is changed but
> > > users are still using old userspace tool, this might lead to the
> > > incompatible data structure.
> > > But not sure if it is a big problem :-)
> >
> > Yeah, like I say I don't really have a good feeling for what would be best 
> > here,
> > I'm just thinking of what I do know and wary of the potential for a "640 
> > bits
> > ought to be enough for anyone" issue ;)
> >
> > >>> +struct map_benchmark_data {
> > >>> +   struct map_benchmark bparam;
> > >>> +   struct device *dev;
> > >>> +   struct dentry  *debugfs;
> > >>> +   atomic64_t total_map_nsecs;
> > >>> +   atomic64_t total_map_loops;
> > >>> +   atomic64_t total_unmap_nsecs;
> > >>> +   atomic64_t total_unmap_loops;
> > >>> +};
> > >>> +
> > >>> +static int map_benchmark_thread(void *data) {
> > >>> +   struct page *page;
> > >>> +   dma_addr_t dma_addr;
> > >>> +   struct map_benchmark_data *map = data;
> > >>> +   int ret = 0;
> > >>> +
> > >>> +   page = alloc_page(GFP_KERNEL);
> > >>> +   if (!page)
> > >>> +   return -ENOMEM;
> > >>> +
> > >>> +   while (!kthread_should_stop())  {
> > >>> +   ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
> > >>> +
> > >>> +   map_stime = ktime_get();
> > >>> +   dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE,
> > >> DMA_BIDIRECTIONAL);
> > >>
> > >> Note that for a non-coherent device, this will give an underestimate
> > >> of the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings,
> > >> since the page will never be dirty in the cache (except possibly the
> > >> very first time through).
> > >
> > > Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1"
> > > after we have this basic framework.
> >
> > That wasn't so much about the direction itself, just that if it's anything 
> > other
> > than FROM_DEVICE, we should probably do something to dirty the buffer by
> a
> > reasonable amount before each map. Otherwise the measured performance
> is
> > going to be unrealistic on many systems.
> 
> Maybe pu

RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-10-31 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Saturday, October 31, 2020 4:48 AM
> To: Song Bao Hua (Barry Song) ;
> iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com
> Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm
> ; linux-kselft...@vger.kernel.org
> Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for streaming
> DMA APIs
> 
> On 2020-10-29 21:39, Song Bao Hua (Barry Song) wrote:
> [...]
> >>> +struct map_benchmark {
> >>> + __u64 map_nsec;
> >>> + __u64 unmap_nsec;
> >>> + __u32 threads; /* how many threads will do map/unmap in parallel
> */
> >>> + __u32 seconds; /* how long the test will last */
> >>> + int node; /* which numa node this benchmark will run on */
> >>> + __u64 expansion[10];/* For future use */
> >>> +};
> >>
> >> I'm no expert on userspace ABIs (and what little experience I do have
> >> is mostly of Win32...), so hopefully someone else will comment if
> >> there's anything of concern here. One thing I wonder is that there's
> >> a fair likelihood of functionality evolving here over time, so might
> >> it be appropriate to have some sort of explicit versioning parameter
> >> for robustness?
> >
> > I copied that from gup_benchmark. There is no this kind of code to
> > compare version.
> > I believe there is a likelihood that kernel module is changed but
> > users are still using old userspace tool, this might lead to the
> > incompatible data structure.
> > But not sure if it is a big problem :-)
> 
> Yeah, like I say I don't really have a good feeling for what would be best 
> here,
> I'm just thinking of what I do know and wary of the potential for a "640 bits
> ought to be enough for anyone" issue ;)
> 
> >>> +struct map_benchmark_data {
> >>> + struct map_benchmark bparam;
> >>> + struct device *dev;
> >>> + struct dentry  *debugfs;
> >>> + atomic64_t total_map_nsecs;
> >>> + atomic64_t total_map_loops;
> >>> + atomic64_t total_unmap_nsecs;
> >>> + atomic64_t total_unmap_loops;
> >>> +};
> >>> +
> >>> +static int map_benchmark_thread(void *data) {
> >>> + struct page *page;
> >>> + dma_addr_t dma_addr;
> >>> + struct map_benchmark_data *map = data;
> >>> + int ret = 0;
> >>> +
> >>> + page = alloc_page(GFP_KERNEL);
> >>> + if (!page)
> >>> + return -ENOMEM;
> >>> +
> >>> + while (!kthread_should_stop())  {
> >>> + ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
> >>> +
> >>> + map_stime = ktime_get();
> >>> + dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE,
> >> DMA_BIDIRECTIONAL);
> >>
> >> Note that for a non-coherent device, this will give an underestimate
> >> of the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings,
> >> since the page will never be dirty in the cache (except possibly the
> >> very first time through).
> >
> > Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1"
> > after we have this basic framework.
> 
> That wasn't so much about the direction itself, just that if it's anything 
> other
> than FROM_DEVICE, we should probably do something to dirty the buffer by a
> reasonable amount before each map. Otherwise the measured performance is
> going to be unrealistic on many systems.

Maybe put a memset(buf, 0, PAGE_SIZE) before dma_map will help ?

> 
> [...]
> >>> + atomic64_add((long long)ktime_to_ns(ktime_sub(unmap_etime,
> >> unmap_stime)),
> >>> + >total_unmap_nsecs);
> >>> + atomic64_inc(>total_map_loops);
> >>> + atomic64_inc(>total_unmap_loops);
> >>
> >> I think it would be worth keeping track of the variances as well - it
> >> can be hard to tell if a reasonable-looking average is hiding
> >> terrible worst-case behaviour.
> >
> > This is a sensible requirement. I believe it is better to be handled
> > by the existing kernel tracing method.
> >
> > Maybe we need a histogram like:
> > Delay   sample count
> > 1-2us   1000  ***
> > 2-3us   2000  ***
> > 3-4us   100   *
> > .
> > This will be more precise than the maximum latency in the worst case.
> &g

Re: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-10-30 Thread Robin Murphy

On 2020-10-29 21:39, Song Bao Hua (Barry Song) wrote:
[...]

+struct map_benchmark {
+   __u64 map_nsec;
+   __u64 unmap_nsec;
+   __u32 threads; /* how many threads will do map/unmap in parallel */
+   __u32 seconds; /* how long the test will last */
+   int node; /* which numa node this benchmark will run on */
+   __u64 expansion[10];/* For future use */
+};


I'm no expert on userspace ABIs (and what little experience I do have is
mostly of Win32...), so hopefully someone else will comment if there's
anything of concern here. One thing I wonder is that there's a fair
likelihood of functionality evolving here over time, so might it be
appropriate to have some sort of explicit versioning parameter for
robustness?


I copied that from gup_benchmark. There is no this kind of code to
compare version.
I believe there is a likelihood that kernel module is changed but users
are still using old userspace tool, this might lead to the incompatible
data structure.
But not sure if it is a big problem :-)


Yeah, like I say I don't really have a good feeling for what would be 
best here, I'm just thinking of what I do know and wary of the potential 
for a "640 bits ought to be enough for anyone" issue ;)



+struct map_benchmark_data {
+   struct map_benchmark bparam;
+   struct device *dev;
+   struct dentry  *debugfs;
+   atomic64_t total_map_nsecs;
+   atomic64_t total_map_loops;
+   atomic64_t total_unmap_nsecs;
+   atomic64_t total_unmap_loops;
+};
+
+static int map_benchmark_thread(void *data)
+{
+   struct page *page;
+   dma_addr_t dma_addr;
+   struct map_benchmark_data *map = data;
+   int ret = 0;
+
+   page = alloc_page(GFP_KERNEL);
+   if (!page)
+   return -ENOMEM;
+
+   while (!kthread_should_stop())  {
+   ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
+
+   map_stime = ktime_get();
+   dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE,

DMA_BIDIRECTIONAL);

Note that for a non-coherent device, this will give an underestimate of
the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings, since
the page will never be dirty in the cache (except possibly the very
first time through).


Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1" after we have
this basic framework.


That wasn't so much about the direction itself, just that if it's 
anything other than FROM_DEVICE, we should probably do something to 
dirty the buffer by a reasonable amount before each map. Otherwise the 
measured performance is going to be unrealistic on many systems.


[...]

+   atomic64_add((long long)ktime_to_ns(ktime_sub(unmap_etime,

unmap_stime)),

+   >total_unmap_nsecs);
+   atomic64_inc(>total_map_loops);
+   atomic64_inc(>total_unmap_loops);


I think it would be worth keeping track of the variances as well - it
can be hard to tell if a reasonable-looking average is hiding terrible
worst-case behaviour.


This is a sensible requirement. I believe it is better to be handled by the
existing kernel tracing method.

Maybe we need a histogram like:
Delay   sample count
1-2us   1000  ***
2-3us   2000  ***
3-4us   100   *
.
This will be more precise than the maximum latency in the worst
case.

I'd believe this can be handled by:
tracepoint  A
Map
Tracepoint  B

Tracepoint   C
Unmap
Tracepoint   D

Let the userspace ebpf to draw the histogram for the delta of B-A and D-C.

So I am planning to put this requirement into todo list and write an userspace
ebpf/bcc script for histogram and put in tools/ directory.

Please give your comments on this.


Right, I wasn't suggesting trying to homebrew a full data collection 
system here - I agree there are better tools for that already - just 
that it's basically free to track a sum of squares alongside a sum, so 
that we can trivially calculate a useful variance (or standard 
deviation) figure alongside the mean at the end.


[...]

+   for (i = 0; i < threads; i++) {
+   tsk[i] = kthread_create_on_node(map_benchmark_thread, map,
+   map->bparam.node, "dma-map-benchmark/%d", i);
+   if (IS_ERR(tsk[i])) {
+   dev_err(map->dev, "create dma_map thread failed\n");
+   return PTR_ERR(tsk[i]);
+   }
+
+   if (node != NUMA_NO_NODE && node_online(node))
+   kthread_bind_mask(tsk[i], cpu_mask);
+
+   wake_up_process(tsk[i]);


Might it be better to create all the threads first, *then* start kicking
them?


The difficulty is that we don't know how many threads we should create as
the thread number is a parameter to test the contention of IOMMU driver.
In my test case, I'd like to test things like
One thread
Two threads

8 threads
12 threads
16 threads...

On the other hand, I 

RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-10-29 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Friday, October 30, 2020 8:38 AM
> To: Song Bao Hua (Barry Song) ;
> iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com
> Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm
> ; linux-kselft...@vger.kernel.org
> Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for streaming
> DMA APIs
> 
> On 2020-10-27 03:53, Barry Song wrote:
> > Nowadays, there are increasing requirements to benchmark the performance
> > of dma_map and dma_unmap particually while the device is attached to an
> > IOMMU.
> >
> > This patch enables the support. Users can run specified number of threads
> > to do dma_map_page and dma_unmap_page on a specific NUMA node with
> the
> > specified duration. Then dma_map_benchmark will calculate the average
> > latency for map and unmap.
> >
> > A difficulity for this benchmark is that dma_map/unmap APIs must run on
> > a particular device. Each device might have different backend of IOMMU or
> > non-IOMMU.
> >
> > So we use the driver_override to bind dma_map_benchmark to a particual
> > device by:
> > echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
> > echo xxx > /sys/bus/platform/drivers/xxx/unbind
> > echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
> >
> > For this moment, it supports platform device only, PCI device will also
> > be supported afterwards.
> 
> Neat! This is something I've thought about many times, but never got
> round to attempting :)

I am happy you have the same needs. When I came to IOMMU area a half year ago,
the first thing I've done was writing a rough benchmark. At that time, I hacked 
kernel
to get a device behind an IOMMU.

Recently, I got some time to think about how to get "device" without ugly 
hacking and
then clean up code for sending patches out to provide a common benchmark in 
order
that everybody can use.

> 
> I think the basic latency measurement for mapping and unmapping pages is
> enough to start with, but there are definitely some more things that
> would be interesting to look into for future enhancements:
> 
>   - a choice of mapping sizes, both smaller and larger than one page, to
> help characterise stuff like cache maintenance overhead and bounce
> buffer/IOVA fragmentation.
>   - alternative allocation patterns like doing lots of maps first, then
> all their corresponding unmaps (to provoke things like the worst-case
> IOVA rcache behaviour).
>   - ways to exercise a range of those parameters at once across
> different threads in a single test.
> 

Yes, sure. Once we have a basic framework, we can add more benchmark patterns
by using different parameters in the userspace tool:
testing/selftests/dma/dma_map_benchmark.c

Similar function extensions have been carried out in GUP_BENCHMARK.

> But let's get a basic framework nailed down first...

Sure.

> 
> > Cc: Joerg Roedel 
> > Cc: Will Deacon 
> > Cc: Shuah Khan 
> > Cc: Christoph Hellwig 
> > Cc: Marek Szyprowski 
> > Cc: Robin Murphy 
> > Signed-off-by: Barry Song 
> > ---
> >   kernel/dma/Kconfig |   8 ++
> >   kernel/dma/Makefile|   1 +
> >   kernel/dma/map_benchmark.c | 202
> +
> >   3 files changed, 211 insertions(+)
> >   create mode 100644 kernel/dma/map_benchmark.c
> >
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index c99de4a21458..949c53da5991 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG
> >   is technically out-of-spec.
> >
> >   If unsure, say N.
> > +
> > +config DMA_MAP_BENCHMARK
> > +   bool "Enable benchmarking of streaming DMA mapping"
> > +   help
> > + Provides /sys/kernel/debug/dma_map_benchmark that helps with
> testing
> > + performance of dma_(un)map_page.
> > +
> > + See tools/testing/selftests/dma/dma_map_benchmark.c
> > diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile
> > index dc755ab68aab..7aa6b26b1348 100644
> > --- a/kernel/dma/Makefile
> > +++ b/kernel/dma/Makefile
> > @@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG)   += debug.o
> >   obj-$(CONFIG_SWIOTLB) += swiotlb.o
> >   obj-$(CONFIG_DMA_COHERENT_POOL)   += pool.o
> >   obj-$(CONFIG_DMA_REMAP)   += remap.o
> > +obj-$(CONFIG_DMA_MAP_BENCHMARK)+= map_benchmark.o
> > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchma

Re: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-10-29 Thread Robin Murphy

On 2020-10-27 03:53, Barry Song wrote:

Nowadays, there are increasing requirements to benchmark the performance
of dma_map and dma_unmap particually while the device is attached to an
IOMMU.

This patch enables the support. Users can run specified number of threads
to do dma_map_page and dma_unmap_page on a specific NUMA node with the
specified duration. Then dma_map_benchmark will calculate the average
latency for map and unmap.

A difficulity for this benchmark is that dma_map/unmap APIs must run on
a particular device. Each device might have different backend of IOMMU or
non-IOMMU.

So we use the driver_override to bind dma_map_benchmark to a particual
device by:
echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
echo xxx > /sys/bus/platform/drivers/xxx/unbind
echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind

For this moment, it supports platform device only, PCI device will also
be supported afterwards.


Neat! This is something I've thought about many times, but never got 
round to attempting :)


I think the basic latency measurement for mapping and unmapping pages is 
enough to start with, but there are definitely some more things that 
would be interesting to look into for future enhancements:


 - a choice of mapping sizes, both smaller and larger than one page, to 
help characterise stuff like cache maintenance overhead and bounce 
buffer/IOVA fragmentation.
 - alternative allocation patterns like doing lots of maps first, then 
all their corresponding unmaps (to provoke things like the worst-case 
IOVA rcache behaviour).
 - ways to exercise a range of those parameters at once across 
different threads in a single test.


But let's get a basic framework nailed down first...


Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Shuah Khan 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Signed-off-by: Barry Song 
---
  kernel/dma/Kconfig |   8 ++
  kernel/dma/Makefile|   1 +
  kernel/dma/map_benchmark.c | 202 +
  3 files changed, 211 insertions(+)
  create mode 100644 kernel/dma/map_benchmark.c

diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index c99de4a21458..949c53da5991 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG
  is technically out-of-spec.
  
  	  If unsure, say N.

+
+config DMA_MAP_BENCHMARK
+   bool "Enable benchmarking of streaming DMA mapping"
+   help
+ Provides /sys/kernel/debug/dma_map_benchmark that helps with testing
+ performance of dma_(un)map_page.
+
+ See tools/testing/selftests/dma/dma_map_benchmark.c
diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile
index dc755ab68aab..7aa6b26b1348 100644
--- a/kernel/dma/Makefile
+++ b/kernel/dma/Makefile
@@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG)   += debug.o
  obj-$(CONFIG_SWIOTLB) += swiotlb.o
  obj-$(CONFIG_DMA_COHERENT_POOL)   += pool.o
  obj-$(CONFIG_DMA_REMAP)   += remap.o
+obj-$(CONFIG_DMA_MAP_BENCHMARK)+= map_benchmark.o
diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
new file mode 100644
index ..16a5d7779d67
--- /dev/null
+++ b/kernel/dma/map_benchmark.c
@@ -0,0 +1,202 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Hisilicon Limited.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 


Nit: alphabetical order is always nice, when there's not an existing 
precedent of a complete mess...



+
+#define DMA_MAP_BENCHMARK  _IOWR('d', 1, struct map_benchmark)
+
+struct map_benchmark {
+   __u64 map_nsec;
+   __u64 unmap_nsec;
+   __u32 threads; /* how many threads will do map/unmap in parallel */
+   __u32 seconds; /* how long the test will last */
+   int node; /* which numa node this benchmark will run on */
+   __u64 expansion[10];/* For future use */
+};


I'm no expert on userspace ABIs (and what little experience I do have is 
mostly of Win32...), so hopefully someone else will comment if there's 
anything of concern here. One thing I wonder is that there's a fair 
likelihood of functionality evolving here over time, so might it be 
appropriate to have some sort of explicit versioning parameter for 
robustness?



+struct map_benchmark_data {
+   struct map_benchmark bparam;
+   struct device *dev;
+   struct dentry  *debugfs;
+   atomic64_t total_map_nsecs;
+   atomic64_t total_map_loops;
+   atomic64_t total_unmap_nsecs;
+   atomic64_t total_unmap_loops;
+};
+
+static int map_benchmark_thread(void *data)
+{
+   struct page *page;
+   dma_addr_t dma_addr;
+   struct map_benchmark_data *map = data;
+   int ret = 0;
+
+   page = alloc_page(GFP_KERNEL);
+   if (!page)
+   return -ENOMEM;
+
+   while (!kthread_should_stop())  {
+