Re: [PATCH v8 0/9] rwsem performance optimizations

2013-11-04 Thread Tim Chen
Ingo,

Sorry for the late response.  My old 4 socket Westmere
test machine went down and I have to find a new one, 
which is a 4 socket Ivybridge machine with 15 cores per socket.

I've updated the workload as a perf benchmark (see patch)
attached.  The workload will mmap, then access memory
in the mmaped area and then unmap, doing so repeatedly
for a specified time.  Each thread is pinned to a
particular core, with the threads distributed evenly between
the sockets. The throughput is reported with standard deviation
info.

First some baseline comparing the workload with serialized mmap vs
without serialized mmap running under vanilla kernel.

Threads Throughput  std dev(%)
serail vs non serial
mmap(%)
1   0.100.16
2   0.780.09
3   -5.00   0.12
4   -3.27   0.08
5   -0.11   0.09
10  5.320.10
20  -2.05   0.05
40  -9.75   0.15
60  11.69   0.05


Here's the data for complete rwsem patch vs the plain vanilla kernel
case.  Overall there's improvement except for the 3 thread case.

Threads Throughput  std dev(%)
vs vanilla(%)
1   0.620.11
2   3.860.10
3   -7.02   0.19
4   -0.01   0.13
5   2.740.06
10  5.660.03
20  1.440.09
40  5.540.09
60  15.63   0.13

Now testing with both patched kernel and vanilla kernel
running serialized mmap with mutex acquisition in user space.

Threads Throughput  std dev(%)
vs vanilla(%)
1   0.600.02
2   6.400.11
3   14.13   0.07
4   -2.41   0.07
5   1.050.08
10  4.150.05
20  -0.26   0.06
40  -3.45   0.13
60  -4.33   0.07

Here's another run with the rwsem patchset without
optimistic spinning

Threads Throughput  std dev(%)
vs vanilla(%)
1   0.810.04
2   2.850.17
3   -4.09   0.05
4   -8.31   0.07
5   -3.19   0.03
10  1.020.05
20  -4.77   0.04
40  -3.11   0.10
60  2.060.10

No-optspin comparing serialized mmaped workload under
patched kernel vs vanilla kernel

Threads Throughput  std dev(%)
vs vanilla(%)
1   0.570.03
2   2.130.17
3   14.78   0.33
4   -1.23   0.11
5   2.990.08
10  -0.43   0.10
20  0.010.03
40  3.030.10
60  -1.74   0.09


The data is a bit of a mixed bag.  I'll spin off
the MCS cleanup patch separately so we can merge that first
for Waiman's qrwlock work.

Tim

---
>From 6c5916315c1515fb2281d9344b2c4f371ca99879 Mon Sep 17 00:00:00 2001
From: Tim Chen 
Date: Wed, 30 Oct 2013 05:18:29 -0700
Subject: [PATCH] perf mmap and memory write test

This patch add a perf benchmark to mmap a piece of memory,
write to the memory and unmap the memory for a given
number of threads.  The threads are distributed and pinned
evenly to the sockets on the machine.  The options
for the benchmark are as follow:

 usage: perf bench mem mmap 

-l, --length <1MB>Specify length of memory to set. Available units: B, 
KB, MB, GB and TB (upper and lower)
-i, --iterations   repeat mmap() invocation this number of times
-n, --threads  number of threads doing mmap() invocation
-r, --runtime  runtime per iteration in sec
-w, --warmup   warmup time in sec
-s, --serialize   serialize the mmap() operations with mutex
-v, --verbose verbose output giving info about each iteration

Signed-off-by: Tim Chen 
---
 tools/perf/Makefile |   1 +
 tools/perf/bench/bench.h|   1 +
 tools/perf/bench/mem-mmap.c | 312 
 tools/perf/builtin-bench.c  |   3 +
 4 files changed, 317 insertions(+)
 create mode 100644 tools/perf/bench/mem-mmap.c

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 64c043b..80e32d1 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -408,6 +408,7 @@ BUILTIN_OBJS += $(OUTPUT)bench/mem-memset-x86-64-asm.o
 endif
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy.o
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memset.o
+BUILTIN_OBJS += $(OUTPUT)bench/mem-mmap.o
 
 BUILTIN_OBJS += $(OUTPUT)builtin-diff.o
 BUILTIN_OBJS += $(OUTPUT)builtin-evlist.o
diff --git 

Re: [PATCH v8 0/9] rwsem performance optimizations

2013-11-04 Thread Tim Chen
Ingo,

Sorry for the late response.  My old 4 socket Westmere
test machine went down and I have to find a new one, 
which is a 4 socket Ivybridge machine with 15 cores per socket.

I've updated the workload as a perf benchmark (see patch)
attached.  The workload will mmap, then access memory
in the mmaped area and then unmap, doing so repeatedly
for a specified time.  Each thread is pinned to a
particular core, with the threads distributed evenly between
the sockets. The throughput is reported with standard deviation
info.

First some baseline comparing the workload with serialized mmap vs
without serialized mmap running under vanilla kernel.

Threads Throughput  std dev(%)
serail vs non serial
mmap(%)
1   0.100.16
2   0.780.09
3   -5.00   0.12
4   -3.27   0.08
5   -0.11   0.09
10  5.320.10
20  -2.05   0.05
40  -9.75   0.15
60  11.69   0.05


Here's the data for complete rwsem patch vs the plain vanilla kernel
case.  Overall there's improvement except for the 3 thread case.

Threads Throughput  std dev(%)
vs vanilla(%)
1   0.620.11
2   3.860.10
3   -7.02   0.19
4   -0.01   0.13
5   2.740.06
10  5.660.03
20  1.440.09
40  5.540.09
60  15.63   0.13

Now testing with both patched kernel and vanilla kernel
running serialized mmap with mutex acquisition in user space.

Threads Throughput  std dev(%)
vs vanilla(%)
1   0.600.02
2   6.400.11
3   14.13   0.07
4   -2.41   0.07
5   1.050.08
10  4.150.05
20  -0.26   0.06
40  -3.45   0.13
60  -4.33   0.07

Here's another run with the rwsem patchset without
optimistic spinning

Threads Throughput  std dev(%)
vs vanilla(%)
1   0.810.04
2   2.850.17
3   -4.09   0.05
4   -8.31   0.07
5   -3.19   0.03
10  1.020.05
20  -4.77   0.04
40  -3.11   0.10
60  2.060.10

No-optspin comparing serialized mmaped workload under
patched kernel vs vanilla kernel

Threads Throughput  std dev(%)
vs vanilla(%)
1   0.570.03
2   2.130.17
3   14.78   0.33
4   -1.23   0.11
5   2.990.08
10  -0.43   0.10
20  0.010.03
40  3.030.10
60  -1.74   0.09


The data is a bit of a mixed bag.  I'll spin off
the MCS cleanup patch separately so we can merge that first
for Waiman's qrwlock work.

Tim

---
From 6c5916315c1515fb2281d9344b2c4f371ca99879 Mon Sep 17 00:00:00 2001
From: Tim Chen tim.c.c...@linux.intel.com
Date: Wed, 30 Oct 2013 05:18:29 -0700
Subject: [PATCH] perf mmap and memory write test

This patch add a perf benchmark to mmap a piece of memory,
write to the memory and unmap the memory for a given
number of threads.  The threads are distributed and pinned
evenly to the sockets on the machine.  The options
for the benchmark are as follow:

 usage: perf bench mem mmap options

-l, --length 1MBSpecify length of memory to set. Available units: B, 
KB, MB, GB and TB (upper and lower)
-i, --iterations n  repeat mmap() invocation this number of times
-n, --threads n number of threads doing mmap() invocation
-r, --runtime n runtime per iteration in sec
-w, --warmup n  warmup time in sec
-s, --serialize   serialize the mmap() operations with mutex
-v, --verbose verbose output giving info about each iteration

Signed-off-by: Tim Chen tim.c.c...@linux.intel.com
---
 tools/perf/Makefile |   1 +
 tools/perf/bench/bench.h|   1 +
 tools/perf/bench/mem-mmap.c | 312 
 tools/perf/builtin-bench.c  |   3 +
 4 files changed, 317 insertions(+)
 create mode 100644 tools/perf/bench/mem-mmap.c

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 64c043b..80e32d1 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -408,6 +408,7 @@ BUILTIN_OBJS += $(OUTPUT)bench/mem-memset-x86-64-asm.o
 endif
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy.o
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memset.o
+BUILTIN_OBJS += $(OUTPUT)bench/mem-mmap.o
 
 BUILTIN_OBJS += $(OUTPUT)builtin-diff.o
 

Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-18 Thread Ingo Molnar

* Tim Chen  wrote:

> 
> > 
> > It would be _really_ nice to stick this into tools/perf/bench/ as:
> > 
> > perf bench mem pagefaults
> > 
> > or so, with a number of parallelism and workload patterns. See 
> > tools/perf/bench/numa.c for a couple of workload generators - although 
> > those are not page fault intense.
> > 
> > So that future generations can run all these tests too and such.
> > 
> > > I compare the throughput where I have the complete rwsem patchset 
> > > against vanilla and the case where I take out the optimistic spin patch.  
> > > I have increased the run time by 10x from my pervious experiments and do 
> > > 10 runs for each case.  The standard deviation is ~1.5% so any changes 
> > > under 1.5% is statistically significant.
> > > 
> > > % change in throughput vs the vanilla kernel.
> > > Threads   all No-optspin
> > > 1 +0.4%   -0.1%
> > > 2 +2.0%   +0.2%
> > > 3 +1.1%   +1.5%
> > > 4 -0.5%   -1.4%
> > > 5 -0.1%   -0.1%
> > > 10+2.2%   -1.2%
> > > 20+237.3% -2.3%
> > > 40+548.1% +0.3%
> > 
> > The tail is impressive. The early parts are important as well, but it's 
> > really hard to tell the significance of the early portion without having 
> > an sttdev column.
> > 
> > ( "perf stat --repeat N" will give you sttdev output, in handy percentage 
> >   form. )
> 
> Quick naive question as I haven't hacked perf bench before.  

Btw., please use tip:master, I've got a few cleanups in there that should 
make it easier to hack.

> Now perf stat gives the statistics of the performance counter or events.
> How do I get it to compute the stats of 
> the throughput reported by perf bench?

What I do is that I measure the execution time, via:

  perf stat --null --repeat 10 perf bench ...

instead of relying on benchmark output.

> Something like
> 
> perf stat -r 10 -- perf bench mm memset --iterations 10
> 
> doesn't quite give what I need.

Yeha. So, perf bench also has a 'simple' output format:

  comet:~/tip> perf bench -f simple sched pipe
  10.378

We could extend 'perf stat' with an option to not measure time, but to 
take any numeric data output from the executed task and use that as the 
measurement result.

If you'd be interested in such a feature I can give it a try.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-18 Thread Ingo Molnar

* Tim Chen tim.c.c...@linux.intel.com wrote:

 
  
  It would be _really_ nice to stick this into tools/perf/bench/ as:
  
  perf bench mem pagefaults
  
  or so, with a number of parallelism and workload patterns. See 
  tools/perf/bench/numa.c for a couple of workload generators - although 
  those are not page fault intense.
  
  So that future generations can run all these tests too and such.
  
   I compare the throughput where I have the complete rwsem patchset 
   against vanilla and the case where I take out the optimistic spin patch.  
   I have increased the run time by 10x from my pervious experiments and do 
   10 runs for each case.  The standard deviation is ~1.5% so any changes 
   under 1.5% is statistically significant.
   
   % change in throughput vs the vanilla kernel.
   Threads   all No-optspin
   1 +0.4%   -0.1%
   2 +2.0%   +0.2%
   3 +1.1%   +1.5%
   4 -0.5%   -1.4%
   5 -0.1%   -0.1%
   10+2.2%   -1.2%
   20+237.3% -2.3%
   40+548.1% +0.3%
  
  The tail is impressive. The early parts are important as well, but it's 
  really hard to tell the significance of the early portion without having 
  an sttdev column.
  
  ( perf stat --repeat N will give you sttdev output, in handy percentage 
form. )
 
 Quick naive question as I haven't hacked perf bench before.  

Btw., please use tip:master, I've got a few cleanups in there that should 
make it easier to hack.

 Now perf stat gives the statistics of the performance counter or events.
 How do I get it to compute the stats of 
 the throughput reported by perf bench?

What I do is that I measure the execution time, via:

  perf stat --null --repeat 10 perf bench ...

instead of relying on benchmark output.

 Something like
 
 perf stat -r 10 -- perf bench mm memset --iterations 10
 
 doesn't quite give what I need.

Yeha. So, perf bench also has a 'simple' output format:

  comet:~/tip perf bench -f simple sched pipe
  10.378

We could extend 'perf stat' with an option to not measure time, but to 
take any numeric data output from the executed task and use that as the 
measurement result.

If you'd be interested in such a feature I can give it a try.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-16 Thread Tim Chen

> 
> It would be _really_ nice to stick this into tools/perf/bench/ as:
> 
>   perf bench mem pagefaults
> 
> or so, with a number of parallelism and workload patterns. See 
> tools/perf/bench/numa.c for a couple of workload generators - although 
> those are not page fault intense.
> 
> So that future generations can run all these tests too and such.
> 
> > I compare the throughput where I have the complete rwsem patchset 
> > against vanilla and the case where I take out the optimistic spin patch.  
> > I have increased the run time by 10x from my pervious experiments and do 
> > 10 runs for each case.  The standard deviation is ~1.5% so any changes 
> > under 1.5% is statistically significant.
> > 
> > % change in throughput vs the vanilla kernel.
> > Threads all No-optspin
> > 1   +0.4%   -0.1%
> > 2   +2.0%   +0.2%
> > 3   +1.1%   +1.5%
> > 4   -0.5%   -1.4%
> > 5   -0.1%   -0.1%
> > 10  +2.2%   -1.2%
> > 20  +237.3% -2.3%
> > 40  +548.1% +0.3%
> 
> The tail is impressive. The early parts are important as well, but it's 
> really hard to tell the significance of the early portion without having 
> an sttdev column.
> 
> ( "perf stat --repeat N" will give you sttdev output, in handy percentage 
>   form. )

Quick naive question as I haven't hacked perf bench before.  
Now perf stat gives the statistics of the performance counter or events.
How do I get it to compute the stats of 
the throughput reported by perf bench?

Something like

perf stat -r 10 -- perf bench mm memset --iterations 10

doesn't quite give what I need.

Pointers appreciated.

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-16 Thread Tim Chen
On Wed, 2013-10-16 at 08:55 +0200, Ingo Molnar wrote:
> * Tim Chen  wrote:
> 
> > On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote:
> > > * Tim Chen  wrote:
> > > 
> > > > The throughput of pure mmap with mutex is below vs pure mmap is below:
> > > > 
> > > > % change in performance of the mmap with pthread-mutex vs pure mmap
> > > > #threadsvanilla all rwsem   without optspin
> > > > patches
> > > > 1   3.0%-1.0%   -1.7%
> > > > 5   7.2%-26.8%  5.5%
> > > > 10  5.2%-10.6%  22.1%
> > > > 20  6.8%16.4%   12.5%
> > > > 40  -0.2%   32.7%   0.0%
> > > > 
> > > > So with mutex, the vanilla kernel and the one without optspin both run 
> > > > faster.  This is consistent with what Peter reported.  With optspin, 
> > > > the 
> > > > picture is more mixed, with lower throughput at low to moderate number 
> > > > of threads and higher throughput with high number of threads.
> > > 
> > > So, going back to your orignal table:
> > > 
> > > > % change in performance of the mmap with pthread-mutex vs pure mmap
> > > > #threadsvanilla all without optspin
> > > > 1   3.0%-1.0%   -1.7%
> > > > 5   7.2%-26.8%  5.5%
> > > > 10  5.2%-10.6%  22.1%
> > > > 20  6.8%16.4%   12.5%
> > > > 40  -0.2%   32.7%   0.0%
> > > >
> > > > In general, vanilla and no-optspin case perform better with 
> > > > pthread-mutex.  For the case with optspin, mmap with pthread-mutex is 
> > > > worse at low to moderate contention and better at high contention.
> > > 
> > > it appears that 'without optspin' appears to be a pretty good choice - if 
> > > it wasn't for that '1 thread' number, which, if I correctly assume is the 
> > > uncontended case, is one of the most common usecases ...
> > > 
> > > How can the single-threaded case get slower? None of the patches should 
> > > really cause noticeable overhead in the non-contended case. That looks 
> > > weird.
> > > 
> > > It would also be nice to see the 2, 3, 4 thread numbers - those are the 
> > > most common contention scenarios in practice - where do we see the first 
> > > improvement in performance?
> > > 
> > > Also, it would be nice to include a noise/sttdev figure, it's really hard 
> > > to tell whether -1.7% is statistically significant.
> > 
> > Ingo,
> > 
> > I think that the optimistic spin changes to rwsem should enhance 
> > performance to real workloads after all.
> > 
> > In my previous tests, I was doing mmap followed immediately by 
> > munmap without doing anything to the memory.  No real workload
> > will behave that way and it is not the scenario that we 
> > should optimize for.  A much better approximation of
> > real usages will be doing mmap, then touching 
> > the memories being mmaped, followed by munmap.  
> 
> That's why I asked for a working testcase to be posted ;-) Not just 
> pseudocode - send the real .c thing please.

I was using a modified version of Anton's will-it-scale test.  I'll try
to port the tests to perf bench to make it easier for other people to
run the tests.

> 
> > This changes the dynamics of the rwsem as we are now dominated by read 
> > acquisitions of mmap sem due to the page faults, instead of having only 
> > write acquisitions from mmap. [...]
> 
> Absolutely, the page fault read case is the #1 optimization target of 
> rwsems.
> 
> > [...] In this case, any delay in write acquisitions will be costly as we 
> > will be blocking a lot of readers.  This is where optimistic spinning on 
> > write acquisitions of mmap sem can provide a very significant boost to 
> > the throughput.
> > 
> > I change the test case to the following with writes to
> > the mmaped memory:
> > 
> > #define MEMSIZE (1 * 1024 * 1024)
> > 
> > char *testcase_description = "Anonymous memory mmap/munmap of 1MB";
> > 
> > void testcase(unsigned long long *iterations)
> > {
> > int i;
> > 
> > while (1) {
> > char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
> >MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> > assert(c != MAP_FAILED);
> > for (i=0; i > c[i] = 0xa;
> > }
> > munmap(c, MEMSIZE);
> > 
> > (*iterations)++;
> > }
> > }
> 
> It would be _really_ nice to stick this into tools/perf/bench/ as:
> 
>   perf bench mem pagefaults
> 
> or so, with a number of parallelism and workload patterns. See 
> tools/perf/bench/numa.c for a couple of workload generators - although 
> those are not page fault intense.
> 
> So that future generations can run all these tests too and such.

Okay, will do.

> 
> > I compare the throughput where I have the complete rwsem patchset 
> > against vanilla and the case where I take 

Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-16 Thread Ingo Molnar

* Tim Chen  wrote:

> On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote:
> > * Tim Chen  wrote:
> > 
> > > The throughput of pure mmap with mutex is below vs pure mmap is below:
> > > 
> > > % change in performance of the mmap with pthread-mutex vs pure mmap
> > > #threadsvanilla   all rwsem   without optspin
> > >   patches
> > > 1   3.0%  -1.0%   -1.7%
> > > 5   7.2%  -26.8%  5.5%
> > > 10  5.2%  -10.6%  22.1%
> > > 20  6.8%  16.4%   12.5%
> > > 40  -0.2% 32.7%   0.0%
> > > 
> > > So with mutex, the vanilla kernel and the one without optspin both run 
> > > faster.  This is consistent with what Peter reported.  With optspin, the 
> > > picture is more mixed, with lower throughput at low to moderate number 
> > > of threads and higher throughput with high number of threads.
> > 
> > So, going back to your orignal table:
> > 
> > > % change in performance of the mmap with pthread-mutex vs pure mmap
> > > #threadsvanilla all without optspin
> > > 1   3.0%-1.0%   -1.7%
> > > 5   7.2%-26.8%  5.5%
> > > 10  5.2%-10.6%  22.1%
> > > 20  6.8%16.4%   12.5%
> > > 40  -0.2%   32.7%   0.0%
> > >
> > > In general, vanilla and no-optspin case perform better with 
> > > pthread-mutex.  For the case with optspin, mmap with pthread-mutex is 
> > > worse at low to moderate contention and better at high contention.
> > 
> > it appears that 'without optspin' appears to be a pretty good choice - if 
> > it wasn't for that '1 thread' number, which, if I correctly assume is the 
> > uncontended case, is one of the most common usecases ...
> > 
> > How can the single-threaded case get slower? None of the patches should 
> > really cause noticeable overhead in the non-contended case. That looks 
> > weird.
> > 
> > It would also be nice to see the 2, 3, 4 thread numbers - those are the 
> > most common contention scenarios in practice - where do we see the first 
> > improvement in performance?
> > 
> > Also, it would be nice to include a noise/sttdev figure, it's really hard 
> > to tell whether -1.7% is statistically significant.
> 
> Ingo,
> 
> I think that the optimistic spin changes to rwsem should enhance 
> performance to real workloads after all.
> 
> In my previous tests, I was doing mmap followed immediately by 
> munmap without doing anything to the memory.  No real workload
> will behave that way and it is not the scenario that we 
> should optimize for.  A much better approximation of
> real usages will be doing mmap, then touching 
> the memories being mmaped, followed by munmap.  

That's why I asked for a working testcase to be posted ;-) Not just 
pseudocode - send the real .c thing please.

> This changes the dynamics of the rwsem as we are now dominated by read 
> acquisitions of mmap sem due to the page faults, instead of having only 
> write acquisitions from mmap. [...]

Absolutely, the page fault read case is the #1 optimization target of 
rwsems.

> [...] In this case, any delay in write acquisitions will be costly as we 
> will be blocking a lot of readers.  This is where optimistic spinning on 
> write acquisitions of mmap sem can provide a very significant boost to 
> the throughput.
> 
> I change the test case to the following with writes to
> the mmaped memory:
> 
> #define MEMSIZE (1 * 1024 * 1024)
> 
> char *testcase_description = "Anonymous memory mmap/munmap of 1MB";
> 
> void testcase(unsigned long long *iterations)
> {
> int i;
> 
> while (1) {
> char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
>MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> assert(c != MAP_FAILED);
> for (i=0; i c[i] = 0xa;
> }
> munmap(c, MEMSIZE);
> 
> (*iterations)++;
> }
> }

It would be _really_ nice to stick this into tools/perf/bench/ as:

perf bench mem pagefaults

or so, with a number of parallelism and workload patterns. See 
tools/perf/bench/numa.c for a couple of workload generators - although 
those are not page fault intense.

So that future generations can run all these tests too and such.

> I compare the throughput where I have the complete rwsem patchset 
> against vanilla and the case where I take out the optimistic spin patch.  
> I have increased the run time by 10x from my pervious experiments and do 
> 10 runs for each case.  The standard deviation is ~1.5% so any changes 
> under 1.5% is statistically significant.
> 
> % change in throughput vs the vanilla kernel.
> Threads   all No-optspin
> 1 +0.4%   -0.1%
> 2 +2.0%   +0.2%
> 3 +1.1%   +1.5%
> 4 -0.5%   -1.4%
> 5 -0.1%   -0.1%
> 10+2.2%   -1.2%
> 20

Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-16 Thread Ingo Molnar

* Tim Chen tim.c.c...@linux.intel.com wrote:

 On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote:
  * Tim Chen tim.c.c...@linux.intel.com wrote:
  
   The throughput of pure mmap with mutex is below vs pure mmap is below:
   
   % change in performance of the mmap with pthread-mutex vs pure mmap
   #threadsvanilla   all rwsem   without optspin
 patches
   1   3.0%  -1.0%   -1.7%
   5   7.2%  -26.8%  5.5%
   10  5.2%  -10.6%  22.1%
   20  6.8%  16.4%   12.5%
   40  -0.2% 32.7%   0.0%
   
   So with mutex, the vanilla kernel and the one without optspin both run 
   faster.  This is consistent with what Peter reported.  With optspin, the 
   picture is more mixed, with lower throughput at low to moderate number 
   of threads and higher throughput with high number of threads.
  
  So, going back to your orignal table:
  
   % change in performance of the mmap with pthread-mutex vs pure mmap
   #threadsvanilla all without optspin
   1   3.0%-1.0%   -1.7%
   5   7.2%-26.8%  5.5%
   10  5.2%-10.6%  22.1%
   20  6.8%16.4%   12.5%
   40  -0.2%   32.7%   0.0%
  
   In general, vanilla and no-optspin case perform better with 
   pthread-mutex.  For the case with optspin, mmap with pthread-mutex is 
   worse at low to moderate contention and better at high contention.
  
  it appears that 'without optspin' appears to be a pretty good choice - if 
  it wasn't for that '1 thread' number, which, if I correctly assume is the 
  uncontended case, is one of the most common usecases ...
  
  How can the single-threaded case get slower? None of the patches should 
  really cause noticeable overhead in the non-contended case. That looks 
  weird.
  
  It would also be nice to see the 2, 3, 4 thread numbers - those are the 
  most common contention scenarios in practice - where do we see the first 
  improvement in performance?
  
  Also, it would be nice to include a noise/sttdev figure, it's really hard 
  to tell whether -1.7% is statistically significant.
 
 Ingo,
 
 I think that the optimistic spin changes to rwsem should enhance 
 performance to real workloads after all.
 
 In my previous tests, I was doing mmap followed immediately by 
 munmap without doing anything to the memory.  No real workload
 will behave that way and it is not the scenario that we 
 should optimize for.  A much better approximation of
 real usages will be doing mmap, then touching 
 the memories being mmaped, followed by munmap.  

That's why I asked for a working testcase to be posted ;-) Not just 
pseudocode - send the real .c thing please.

 This changes the dynamics of the rwsem as we are now dominated by read 
 acquisitions of mmap sem due to the page faults, instead of having only 
 write acquisitions from mmap. [...]

Absolutely, the page fault read case is the #1 optimization target of 
rwsems.

 [...] In this case, any delay in write acquisitions will be costly as we 
 will be blocking a lot of readers.  This is where optimistic spinning on 
 write acquisitions of mmap sem can provide a very significant boost to 
 the throughput.
 
 I change the test case to the following with writes to
 the mmaped memory:
 
 #define MEMSIZE (1 * 1024 * 1024)
 
 char *testcase_description = Anonymous memory mmap/munmap of 1MB;
 
 void testcase(unsigned long long *iterations)
 {
 int i;
 
 while (1) {
 char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
 assert(c != MAP_FAILED);
 for (i=0; iMEMSIZE; i+=8) {
 c[i] = 0xa;
 }
 munmap(c, MEMSIZE);
 
 (*iterations)++;
 }
 }

It would be _really_ nice to stick this into tools/perf/bench/ as:

perf bench mem pagefaults

or so, with a number of parallelism and workload patterns. See 
tools/perf/bench/numa.c for a couple of workload generators - although 
those are not page fault intense.

So that future generations can run all these tests too and such.

 I compare the throughput where I have the complete rwsem patchset 
 against vanilla and the case where I take out the optimistic spin patch.  
 I have increased the run time by 10x from my pervious experiments and do 
 10 runs for each case.  The standard deviation is ~1.5% so any changes 
 under 1.5% is statistically significant.
 
 % change in throughput vs the vanilla kernel.
 Threads   all No-optspin
 1 +0.4%   -0.1%
 2 +2.0%   +0.2%
 3 +1.1%   +1.5%
 4 -0.5%   -1.4%
 5 -0.1%   -0.1%
 10+2.2%   -1.2%
 20+237.3% -2.3%
 40+548.1% +0.3%

The tail is impressive. The early parts are important as 

Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-16 Thread Tim Chen
On Wed, 2013-10-16 at 08:55 +0200, Ingo Molnar wrote:
 * Tim Chen tim.c.c...@linux.intel.com wrote:
 
  On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote:
   * Tim Chen tim.c.c...@linux.intel.com wrote:
   
The throughput of pure mmap with mutex is below vs pure mmap is below:

% change in performance of the mmap with pthread-mutex vs pure mmap
#threadsvanilla all rwsem   without optspin
patches
1   3.0%-1.0%   -1.7%
5   7.2%-26.8%  5.5%
10  5.2%-10.6%  22.1%
20  6.8%16.4%   12.5%
40  -0.2%   32.7%   0.0%

So with mutex, the vanilla kernel and the one without optspin both run 
faster.  This is consistent with what Peter reported.  With optspin, 
the 
picture is more mixed, with lower throughput at low to moderate number 
of threads and higher throughput with high number of threads.
   
   So, going back to your orignal table:
   
% change in performance of the mmap with pthread-mutex vs pure mmap
#threadsvanilla all without optspin
1   3.0%-1.0%   -1.7%
5   7.2%-26.8%  5.5%
10  5.2%-10.6%  22.1%
20  6.8%16.4%   12.5%
40  -0.2%   32.7%   0.0%
   
In general, vanilla and no-optspin case perform better with 
pthread-mutex.  For the case with optspin, mmap with pthread-mutex is 
worse at low to moderate contention and better at high contention.
   
   it appears that 'without optspin' appears to be a pretty good choice - if 
   it wasn't for that '1 thread' number, which, if I correctly assume is the 
   uncontended case, is one of the most common usecases ...
   
   How can the single-threaded case get slower? None of the patches should 
   really cause noticeable overhead in the non-contended case. That looks 
   weird.
   
   It would also be nice to see the 2, 3, 4 thread numbers - those are the 
   most common contention scenarios in practice - where do we see the first 
   improvement in performance?
   
   Also, it would be nice to include a noise/sttdev figure, it's really hard 
   to tell whether -1.7% is statistically significant.
  
  Ingo,
  
  I think that the optimistic spin changes to rwsem should enhance 
  performance to real workloads after all.
  
  In my previous tests, I was doing mmap followed immediately by 
  munmap without doing anything to the memory.  No real workload
  will behave that way and it is not the scenario that we 
  should optimize for.  A much better approximation of
  real usages will be doing mmap, then touching 
  the memories being mmaped, followed by munmap.  
 
 That's why I asked for a working testcase to be posted ;-) Not just 
 pseudocode - send the real .c thing please.

I was using a modified version of Anton's will-it-scale test.  I'll try
to port the tests to perf bench to make it easier for other people to
run the tests.

 
  This changes the dynamics of the rwsem as we are now dominated by read 
  acquisitions of mmap sem due to the page faults, instead of having only 
  write acquisitions from mmap. [...]
 
 Absolutely, the page fault read case is the #1 optimization target of 
 rwsems.
 
  [...] In this case, any delay in write acquisitions will be costly as we 
  will be blocking a lot of readers.  This is where optimistic spinning on 
  write acquisitions of mmap sem can provide a very significant boost to 
  the throughput.
  
  I change the test case to the following with writes to
  the mmaped memory:
  
  #define MEMSIZE (1 * 1024 * 1024)
  
  char *testcase_description = Anonymous memory mmap/munmap of 1MB;
  
  void testcase(unsigned long long *iterations)
  {
  int i;
  
  while (1) {
  char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
 MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  assert(c != MAP_FAILED);
  for (i=0; iMEMSIZE; i+=8) {
  c[i] = 0xa;
  }
  munmap(c, MEMSIZE);
  
  (*iterations)++;
  }
  }
 
 It would be _really_ nice to stick this into tools/perf/bench/ as:
 
   perf bench mem pagefaults
 
 or so, with a number of parallelism and workload patterns. See 
 tools/perf/bench/numa.c for a couple of workload generators - although 
 those are not page fault intense.
 
 So that future generations can run all these tests too and such.

Okay, will do.

 
  I compare the throughput where I have the complete rwsem patchset 
  against vanilla and the case where I take out the optimistic spin patch.  
  I have increased the run time by 10x from my pervious experiments and do 
  10 runs for each case.  The standard deviation is ~1.5% so any changes 
  under 1.5% is statistically 

Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-16 Thread Tim Chen

 
 It would be _really_ nice to stick this into tools/perf/bench/ as:
 
   perf bench mem pagefaults
 
 or so, with a number of parallelism and workload patterns. See 
 tools/perf/bench/numa.c for a couple of workload generators - although 
 those are not page fault intense.
 
 So that future generations can run all these tests too and such.
 
  I compare the throughput where I have the complete rwsem patchset 
  against vanilla and the case where I take out the optimistic spin patch.  
  I have increased the run time by 10x from my pervious experiments and do 
  10 runs for each case.  The standard deviation is ~1.5% so any changes 
  under 1.5% is statistically significant.
  
  % change in throughput vs the vanilla kernel.
  Threads all No-optspin
  1   +0.4%   -0.1%
  2   +2.0%   +0.2%
  3   +1.1%   +1.5%
  4   -0.5%   -1.4%
  5   -0.1%   -0.1%
  10  +2.2%   -1.2%
  20  +237.3% -2.3%
  40  +548.1% +0.3%
 
 The tail is impressive. The early parts are important as well, but it's 
 really hard to tell the significance of the early portion without having 
 an sttdev column.
 
 ( perf stat --repeat N will give you sttdev output, in handy percentage 
   form. )

Quick naive question as I haven't hacked perf bench before.  
Now perf stat gives the statistics of the performance counter or events.
How do I get it to compute the stats of 
the throughput reported by perf bench?

Something like

perf stat -r 10 -- perf bench mm memset --iterations 10

doesn't quite give what I need.

Pointers appreciated.

Tim

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-15 Thread Tim Chen
On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote:
> * Tim Chen  wrote:
> 
> > The throughput of pure mmap with mutex is below vs pure mmap is below:
> > 
> > % change in performance of the mmap with pthread-mutex vs pure mmap
> > #threadsvanilla all rwsem   without optspin
> > patches
> > 1   3.0%-1.0%   -1.7%
> > 5   7.2%-26.8%  5.5%
> > 10  5.2%-10.6%  22.1%
> > 20  6.8%16.4%   12.5%
> > 40  -0.2%   32.7%   0.0%
> > 
> > So with mutex, the vanilla kernel and the one without optspin both run 
> > faster.  This is consistent with what Peter reported.  With optspin, the 
> > picture is more mixed, with lower throughput at low to moderate number 
> > of threads and higher throughput with high number of threads.
> 
> So, going back to your orignal table:
> 
> > % change in performance of the mmap with pthread-mutex vs pure mmap
> > #threadsvanilla all without optspin
> > 1   3.0%-1.0%   -1.7%
> > 5   7.2%-26.8%  5.5%
> > 10  5.2%-10.6%  22.1%
> > 20  6.8%16.4%   12.5%
> > 40  -0.2%   32.7%   0.0%
> >
> > In general, vanilla and no-optspin case perform better with 
> > pthread-mutex.  For the case with optspin, mmap with pthread-mutex is 
> > worse at low to moderate contention and better at high contention.
> 
> it appears that 'without optspin' appears to be a pretty good choice - if 
> it wasn't for that '1 thread' number, which, if I correctly assume is the 
> uncontended case, is one of the most common usecases ...
> 
> How can the single-threaded case get slower? None of the patches should 
> really cause noticeable overhead in the non-contended case. That looks 
> weird.
> 
> It would also be nice to see the 2, 3, 4 thread numbers - those are the 
> most common contention scenarios in practice - where do we see the first 
> improvement in performance?
> 
> Also, it would be nice to include a noise/sttdev figure, it's really hard 
> to tell whether -1.7% is statistically significant.

Ingo,

I think that the optimistic spin changes to rwsem should enhance
performance to real workloads after all.

In my previous tests, I was doing mmap followed immediately by 
munmap without doing anything to the memory.  No real workload
will behave that way and it is not the scenario that we 
should optimize for.  A much better approximation of
real usages will be doing mmap, then touching 
the memories being mmaped, followed by munmap.  

This changes the dynamics of the rwsem as we are now dominated
by read acquisitions of mmap sem due to the page faults, instead
of having only write acquisitions from mmap. In this case, any delay 
in write acquisitions will be costly as we will be
blocking a lot of readers.  This is where optimistic spinning on
write acquisitions of mmap sem can provide a very significant boost
to the throughput.

I change the test case to the following with writes to
the mmaped memory:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = "Anonymous memory mmap/munmap of 1MB";

void testcase(unsigned long long *iterations)
{
int i;

while (1) {
char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
assert(c != MAP_FAILED);
for (i=0; ihttp://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-15 Thread Tim Chen
On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote:
 * Tim Chen tim.c.c...@linux.intel.com wrote:
 
  The throughput of pure mmap with mutex is below vs pure mmap is below:
  
  % change in performance of the mmap with pthread-mutex vs pure mmap
  #threadsvanilla all rwsem   without optspin
  patches
  1   3.0%-1.0%   -1.7%
  5   7.2%-26.8%  5.5%
  10  5.2%-10.6%  22.1%
  20  6.8%16.4%   12.5%
  40  -0.2%   32.7%   0.0%
  
  So with mutex, the vanilla kernel and the one without optspin both run 
  faster.  This is consistent with what Peter reported.  With optspin, the 
  picture is more mixed, with lower throughput at low to moderate number 
  of threads and higher throughput with high number of threads.
 
 So, going back to your orignal table:
 
  % change in performance of the mmap with pthread-mutex vs pure mmap
  #threadsvanilla all without optspin
  1   3.0%-1.0%   -1.7%
  5   7.2%-26.8%  5.5%
  10  5.2%-10.6%  22.1%
  20  6.8%16.4%   12.5%
  40  -0.2%   32.7%   0.0%
 
  In general, vanilla and no-optspin case perform better with 
  pthread-mutex.  For the case with optspin, mmap with pthread-mutex is 
  worse at low to moderate contention and better at high contention.
 
 it appears that 'without optspin' appears to be a pretty good choice - if 
 it wasn't for that '1 thread' number, which, if I correctly assume is the 
 uncontended case, is one of the most common usecases ...
 
 How can the single-threaded case get slower? None of the patches should 
 really cause noticeable overhead in the non-contended case. That looks 
 weird.
 
 It would also be nice to see the 2, 3, 4 thread numbers - those are the 
 most common contention scenarios in practice - where do we see the first 
 improvement in performance?
 
 Also, it would be nice to include a noise/sttdev figure, it's really hard 
 to tell whether -1.7% is statistically significant.

Ingo,

I think that the optimistic spin changes to rwsem should enhance
performance to real workloads after all.

In my previous tests, I was doing mmap followed immediately by 
munmap without doing anything to the memory.  No real workload
will behave that way and it is not the scenario that we 
should optimize for.  A much better approximation of
real usages will be doing mmap, then touching 
the memories being mmaped, followed by munmap.  

This changes the dynamics of the rwsem as we are now dominated
by read acquisitions of mmap sem due to the page faults, instead
of having only write acquisitions from mmap. In this case, any delay 
in write acquisitions will be costly as we will be
blocking a lot of readers.  This is where optimistic spinning on
write acquisitions of mmap sem can provide a very significant boost
to the throughput.

I change the test case to the following with writes to
the mmaped memory:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = Anonymous memory mmap/munmap of 1MB;

void testcase(unsigned long long *iterations)
{
int i;

while (1) {
char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
assert(c != MAP_FAILED);
for (i=0; iMEMSIZE; i+=8) {
c[i] = 0xa;
}
munmap(c, MEMSIZE);

(*iterations)++;
}
}

I compare the throughput where I have the complete rwsem 
patchset against vanilla and the case where I take out the 
optimistic spin patch.  I have increased the run time
by 10x from my pervious experiments and do 10 runs for
each case.  The standard deviation is ~1.5% so any changes
under 1.5% is statistically significant.

% change in throughput vs the vanilla kernel.
Threads all No-optspin
1   +0.4%   -0.1%
2   +2.0%   +0.2%
3   +1.1%   +1.5%
4   -0.5%   -1.4%
5   -0.1%   -0.1%
10  +2.2%   -1.2%
20  +237.3% -2.3%
40  +548.1% +0.3%

For threads 1 to 5, we essentially
have about the same performance as the vanilla case.
We are getting a boost in throughput by 237% for 20 threads
and 548% for 40 threads.  Now when we take out
the optimistic spin, we have mostly similar throughput as
the vanilla kernel for this test.

When I look at the profile of the vanilla
kernel for the 40 threads case, I saw 80% of
cpu time is spent contending for the spin lock of the rwsem
wait queue, when rwsem_down_read_failed in page fault.
When I apply the rwsem patchset with optimistic spin,
this lock contention went down to only 2% of cpu time.

Now when I test the case where we acquire mutex in the
user space before mmap, I got the following data versus
vanilla kernel.  There's little contention on mmap sem 
acquisition in this case.

n   all   

Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-10 Thread Ingo Molnar

* Tim Chen  wrote:

> The throughput of pure mmap with mutex is below vs pure mmap is below:
> 
> % change in performance of the mmap with pthread-mutex vs pure mmap
> #threadsvanilla   all rwsem   without optspin
>   patches
> 1   3.0%  -1.0%   -1.7%
> 5   7.2%  -26.8%  5.5%
> 10  5.2%  -10.6%  22.1%
> 20  6.8%  16.4%   12.5%
> 40  -0.2% 32.7%   0.0%
> 
> So with mutex, the vanilla kernel and the one without optspin both run 
> faster.  This is consistent with what Peter reported.  With optspin, the 
> picture is more mixed, with lower throughput at low to moderate number 
> of threads and higher throughput with high number of threads.

So, going back to your orignal table:

> % change in performance of the mmap with pthread-mutex vs pure mmap
> #threadsvanilla all without optspin
> 1   3.0%-1.0%   -1.7%
> 5   7.2%-26.8%  5.5%
> 10  5.2%-10.6%  22.1%
> 20  6.8%16.4%   12.5%
> 40  -0.2%   32.7%   0.0%
>
> In general, vanilla and no-optspin case perform better with 
> pthread-mutex.  For the case with optspin, mmap with pthread-mutex is 
> worse at low to moderate contention and better at high contention.

it appears that 'without optspin' appears to be a pretty good choice - if 
it wasn't for that '1 thread' number, which, if I correctly assume is the 
uncontended case, is one of the most common usecases ...

How can the single-threaded case get slower? None of the patches should 
really cause noticeable overhead in the non-contended case. That looks 
weird.

It would also be nice to see the 2, 3, 4 thread numbers - those are the 
most common contention scenarios in practice - where do we see the first 
improvement in performance?

Also, it would be nice to include a noise/sttdev figure, it's really hard 
to tell whether -1.7% is statistically significant.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-10 Thread Ingo Molnar

* Tim Chen tim.c.c...@linux.intel.com wrote:

 The throughput of pure mmap with mutex is below vs pure mmap is below:
 
 % change in performance of the mmap with pthread-mutex vs pure mmap
 #threadsvanilla   all rwsem   without optspin
   patches
 1   3.0%  -1.0%   -1.7%
 5   7.2%  -26.8%  5.5%
 10  5.2%  -10.6%  22.1%
 20  6.8%  16.4%   12.5%
 40  -0.2% 32.7%   0.0%
 
 So with mutex, the vanilla kernel and the one without optspin both run 
 faster.  This is consistent with what Peter reported.  With optspin, the 
 picture is more mixed, with lower throughput at low to moderate number 
 of threads and higher throughput with high number of threads.

So, going back to your orignal table:

 % change in performance of the mmap with pthread-mutex vs pure mmap
 #threadsvanilla all without optspin
 1   3.0%-1.0%   -1.7%
 5   7.2%-26.8%  5.5%
 10  5.2%-10.6%  22.1%
 20  6.8%16.4%   12.5%
 40  -0.2%   32.7%   0.0%

 In general, vanilla and no-optspin case perform better with 
 pthread-mutex.  For the case with optspin, mmap with pthread-mutex is 
 worse at low to moderate contention and better at high contention.

it appears that 'without optspin' appears to be a pretty good choice - if 
it wasn't for that '1 thread' number, which, if I correctly assume is the 
uncontended case, is one of the most common usecases ...

How can the single-threaded case get slower? None of the patches should 
really cause noticeable overhead in the non-contended case. That looks 
weird.

It would also be nice to see the 2, 3, 4 thread numbers - those are the 
most common contention scenarios in practice - where do we see the first 
improvement in performance?

Also, it would be nice to include a noise/sttdev figure, it's really hard 
to tell whether -1.7% is statistically significant.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-09 Thread Davidlohr Bueso
On Wed, 2013-10-09 at 20:14 -0700, Linus Torvalds wrote:
> On Wed, Oct 9, 2013 at 12:28 AM, Peter Zijlstra  wrote:
> >
> > The workload that I got the report from was a virus scanner, it would
> > spawn nr_cpus threads and {mmap file, scan content, munmap} through your
> > filesystem.
> 
> So I suspect we could make the mmap_sem write area *much* smaller for
> the normal cases.
> 
> Look at do_mmap_pgoff(), for example: it is run entirely under
> mmap_sem, but 99% of what it does doesn't actually need the lock.
> 
> The part that really needs the lock is
> 
> addr = get_unmapped_area(file, addr, len, pgoff, flags);
> addr = mmap_region(file, addr, len, vm_flags, pgoff);
> 
> but we hold it over all the other stuff too.
> 

True. By looking at the callers, we're always doing:

down_write(>mmap_sem);
do_mmap_pgoff()
...
up_write(>mmap_sem);

That goes for shm, aio, and of course mmap_pgoff().

While I know you hate two level locking, one way to go about this is to
take the lock inside do_mmap_pgoff() after the initial checks (flags,
page align, etc.) and return with the lock held, leaving the caller to
unlock it. 

> In fact, even if we moved the mmap_sem down into do_mmap(), and moved
> code around a bit to only hold it over those functions, it would still
> cover unnecessarily much. For example, while merging is common, not
> merging is pretty common too, and we do that
> 
> vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> 
> allocation under the lock. We could easily do things like preallocate
> it outside the lock.
> 

AFAICT there are also checks that should be done at the beginning of the
function, such as checking for MAP_LOCKED and VM_LOCKED flags before
calling get_unmapped_area().

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-09 Thread Linus Torvalds
On Wed, Oct 9, 2013 at 12:28 AM, Peter Zijlstra  wrote:
>
> The workload that I got the report from was a virus scanner, it would
> spawn nr_cpus threads and {mmap file, scan content, munmap} through your
> filesystem.

So I suspect we could make the mmap_sem write area *much* smaller for
the normal cases.

Look at do_mmap_pgoff(), for example: it is run entirely under
mmap_sem, but 99% of what it does doesn't actually need the lock.

The part that really needs the lock is

addr = get_unmapped_area(file, addr, len, pgoff, flags);
addr = mmap_region(file, addr, len, vm_flags, pgoff);

but we hold it over all the other stuff too.

In fact, even if we moved the mmap_sem down into do_mmap(), and moved
code around a bit to only hold it over those functions, it would still
cover unnecessarily much. For example, while merging is common, not
merging is pretty common too, and we do that

vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);

allocation under the lock. We could easily do things like preallocate
it outside the lock.

Right now mmap_sem covers pretty much the whole system call (we do do
some security checks outside of it).

I think the main issue is that nobody has ever cared deeply enough to
see how far this could be pushed. I suspect there is some low-hanging
fruit for anybody who is willing to handle the pain..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-09 Thread Tim Chen
On Wed, 2013-10-09 at 08:15 +0200, Ingo Molnar wrote:
> * Tim Chen  wrote:
> 
> > Ingo,
> > 
> > I ran the vanilla kernel, the kernel with all rwsem patches and the 
> > kernel with all patches except the optimistic spin one.  I am listing 
> > two presentations of the data.  Please note that there is about 5% 
> > run-run variation.
> > 
> > % change in performance vs vanilla kernel
> > #threadsall without optspin
> > mmap only   
> > 1   1.9%1.6%
> > 5   43.8%   2.6%
> > 10  22.7%   -3.0%
> > 20  -12.0%  -4.5%
> > 40  -26.9%  -2.0%
> > mmap with mutex acquisition 
> > 1   -2.1%   -3.0%
> > 5   -1.9%   1.0%
> > 10  4.2%12.5%
> > 20  -4.1%   0.6%
> > 40  -2.8%   -1.9%
> 
> Silly question: how do the two methods of starting N threads compare to 
> each other? 

They both started N pthreads and run for a fixed time. 
The throughput of pure mmap with mutex is below vs pure mmap is below:

% change in performance of the mmap with pthread-mutex vs pure mmap
#threadsvanilla all rwsem   without optspin
patches
1   3.0%-1.0%   -1.7%
5   7.2%-26.8%  5.5%
10  5.2%-10.6%  22.1%
20  6.8%16.4%   12.5%
40  -0.2%   32.7%   0.0%

So with mutex, the vanilla kernel and the one without optspin both
run faster.  This is consistent with what Peter reported.  With
optspin, the picture is more mixed, with lower throughput at low to
moderate number of threads and higher throughput with high number
of threads.

> Do they have identical runtimes? 

Yes, they both have identical runtimes.  I look at the number 
of mmap and munmap operations I could push through.

> I think PeterZ's point was 
> that the pthread_mutex case, despite adding extra serialization, actually 
> runs faster in some circumstances.

Yes, I also see the pthread mutex run faster for the vanilla kernel
from the data above.

> 
> Also, mind posting the testcase? What 'work' do the threads do - clear 
> some memory area? 

The test case do simple mmap and munmap 1MB memory per iteration.

> How big is the memory area?

1MB

The two cases are created as:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = "Anonymous memory mmap/munmap of 1MB";

void testcase(unsigned long long *iterations)
{
while (1) {
char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
assert(c != MAP_FAILED);
munmap(c, MEMSIZE);

(*iterations)++;
}
}

and adding mutex to serialize:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = "Anonymous memory mmap/munmap of 1MB with
mutex";

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

void testcase(unsigned long long *iterations)
{
while (1) {
pthread_mutex_lock();
char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
assert(c != MAP_FAILED);
munmap(c, MEMSIZE);
pthread_mutex_unlock();

(*iterations)++;
}
}

and run as a pthread.
> 
> I'd expect this to be about large enough mmap()s showing page fault 
> processing to be mmap_sem bound and the serialization via pthread_mutex() 
> sets up a 'train' of threads in one case, while the parallel startup would 
> run into the mmap_sem in the regular case.
> 
> So I'd expect this to be a rather sensitive workload and you'd have to 
> actively engineer it to hit the effect PeterZ mentioned. I could imagine 
> MPI workloads to run into such patterns - but not deterministically.
> 
> Only once you've convinced yourself that you are hitting that kind of 
> effect reliably on the vanilla kernel, could/should the effects of an 
> improved rwsem implementation be measured.
> 
> Thanks,
> 
>   Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-09 Thread Peter Zijlstra
On Wed, Oct 09, 2013 at 08:15:51AM +0200, Ingo Molnar wrote:
> So I'd expect this to be a rather sensitive workload and you'd have to 
> actively engineer it to hit the effect PeterZ mentioned. I could imagine 
> MPI workloads to run into such patterns - but not deterministically.

The workload that I got the report from was a virus scanner, it would
spawn nr_cpus threads and {mmap file, scan content, munmap} through your
filesystem.

Now if I only could remember who reported this.. :/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-09 Thread Ingo Molnar

* Tim Chen  wrote:

> Ingo,
> 
> I ran the vanilla kernel, the kernel with all rwsem patches and the 
> kernel with all patches except the optimistic spin one.  I am listing 
> two presentations of the data.  Please note that there is about 5% 
> run-run variation.
> 
> % change in performance vs vanilla kernel
> #threads  all without optspin
> mmap only 
> 1 1.9%1.6%
> 5 43.8%   2.6%
> 1022.7%   -3.0%
> 20-12.0%  -4.5%
> 40-26.9%  -2.0%
> mmap with mutex acquisition   
> 1 -2.1%   -3.0%
> 5 -1.9%   1.0%
> 104.2%12.5%
> 20-4.1%   0.6%
> 40-2.8%   -1.9%

Silly question: how do the two methods of starting N threads compare to 
each other? Do they have identical runtimes? I think PeterZ's point was 
that the pthread_mutex case, despite adding extra serialization, actually 
runs faster in some circumstances.

Also, mind posting the testcase? What 'work' do the threads do - clear 
some memory area? How big is the memory area?

I'd expect this to be about large enough mmap()s showing page fault 
processing to be mmap_sem bound and the serialization via pthread_mutex() 
sets up a 'train' of threads in one case, while the parallel startup would 
run into the mmap_sem in the regular case.

So I'd expect this to be a rather sensitive workload and you'd have to 
actively engineer it to hit the effect PeterZ mentioned. I could imagine 
MPI workloads to run into such patterns - but not deterministically.

Only once you've convinced yourself that you are hitting that kind of 
effect reliably on the vanilla kernel, could/should the effects of an 
improved rwsem implementation be measured.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-09 Thread Ingo Molnar

* Tim Chen tim.c.c...@linux.intel.com wrote:

 Ingo,
 
 I ran the vanilla kernel, the kernel with all rwsem patches and the 
 kernel with all patches except the optimistic spin one.  I am listing 
 two presentations of the data.  Please note that there is about 5% 
 run-run variation.
 
 % change in performance vs vanilla kernel
 #threads  all without optspin
 mmap only 
 1 1.9%1.6%
 5 43.8%   2.6%
 1022.7%   -3.0%
 20-12.0%  -4.5%
 40-26.9%  -2.0%
 mmap with mutex acquisition   
 1 -2.1%   -3.0%
 5 -1.9%   1.0%
 104.2%12.5%
 20-4.1%   0.6%
 40-2.8%   -1.9%

Silly question: how do the two methods of starting N threads compare to 
each other? Do they have identical runtimes? I think PeterZ's point was 
that the pthread_mutex case, despite adding extra serialization, actually 
runs faster in some circumstances.

Also, mind posting the testcase? What 'work' do the threads do - clear 
some memory area? How big is the memory area?

I'd expect this to be about large enough mmap()s showing page fault 
processing to be mmap_sem bound and the serialization via pthread_mutex() 
sets up a 'train' of threads in one case, while the parallel startup would 
run into the mmap_sem in the regular case.

So I'd expect this to be a rather sensitive workload and you'd have to 
actively engineer it to hit the effect PeterZ mentioned. I could imagine 
MPI workloads to run into such patterns - but not deterministically.

Only once you've convinced yourself that you are hitting that kind of 
effect reliably on the vanilla kernel, could/should the effects of an 
improved rwsem implementation be measured.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-09 Thread Peter Zijlstra
On Wed, Oct 09, 2013 at 08:15:51AM +0200, Ingo Molnar wrote:
 So I'd expect this to be a rather sensitive workload and you'd have to 
 actively engineer it to hit the effect PeterZ mentioned. I could imagine 
 MPI workloads to run into such patterns - but not deterministically.

The workload that I got the report from was a virus scanner, it would
spawn nr_cpus threads and {mmap file, scan content, munmap} through your
filesystem.

Now if I only could remember who reported this.. :/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-09 Thread Tim Chen
On Wed, 2013-10-09 at 08:15 +0200, Ingo Molnar wrote:
 * Tim Chen tim.c.c...@linux.intel.com wrote:
 
  Ingo,
  
  I ran the vanilla kernel, the kernel with all rwsem patches and the 
  kernel with all patches except the optimistic spin one.  I am listing 
  two presentations of the data.  Please note that there is about 5% 
  run-run variation.
  
  % change in performance vs vanilla kernel
  #threadsall without optspin
  mmap only   
  1   1.9%1.6%
  5   43.8%   2.6%
  10  22.7%   -3.0%
  20  -12.0%  -4.5%
  40  -26.9%  -2.0%
  mmap with mutex acquisition 
  1   -2.1%   -3.0%
  5   -1.9%   1.0%
  10  4.2%12.5%
  20  -4.1%   0.6%
  40  -2.8%   -1.9%
 
 Silly question: how do the two methods of starting N threads compare to 
 each other? 

They both started N pthreads and run for a fixed time. 
The throughput of pure mmap with mutex is below vs pure mmap is below:

% change in performance of the mmap with pthread-mutex vs pure mmap
#threadsvanilla all rwsem   without optspin
patches
1   3.0%-1.0%   -1.7%
5   7.2%-26.8%  5.5%
10  5.2%-10.6%  22.1%
20  6.8%16.4%   12.5%
40  -0.2%   32.7%   0.0%

So with mutex, the vanilla kernel and the one without optspin both
run faster.  This is consistent with what Peter reported.  With
optspin, the picture is more mixed, with lower throughput at low to
moderate number of threads and higher throughput with high number
of threads.

 Do they have identical runtimes? 

Yes, they both have identical runtimes.  I look at the number 
of mmap and munmap operations I could push through.

 I think PeterZ's point was 
 that the pthread_mutex case, despite adding extra serialization, actually 
 runs faster in some circumstances.

Yes, I also see the pthread mutex run faster for the vanilla kernel
from the data above.

 
 Also, mind posting the testcase? What 'work' do the threads do - clear 
 some memory area? 

The test case do simple mmap and munmap 1MB memory per iteration.

 How big is the memory area?

1MB

The two cases are created as:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = Anonymous memory mmap/munmap of 1MB;

void testcase(unsigned long long *iterations)
{
while (1) {
char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
assert(c != MAP_FAILED);
munmap(c, MEMSIZE);

(*iterations)++;
}
}

and adding mutex to serialize:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = Anonymous memory mmap/munmap of 1MB with
mutex;

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

void testcase(unsigned long long *iterations)
{
while (1) {
pthread_mutex_lock(mutex);
char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
assert(c != MAP_FAILED);
munmap(c, MEMSIZE);
pthread_mutex_unlock(mutex);

(*iterations)++;
}
}

and run as a pthread.
 
 I'd expect this to be about large enough mmap()s showing page fault 
 processing to be mmap_sem bound and the serialization via pthread_mutex() 
 sets up a 'train' of threads in one case, while the parallel startup would 
 run into the mmap_sem in the regular case.
 
 So I'd expect this to be a rather sensitive workload and you'd have to 
 actively engineer it to hit the effect PeterZ mentioned. I could imagine 
 MPI workloads to run into such patterns - but not deterministically.
 
 Only once you've convinced yourself that you are hitting that kind of 
 effect reliably on the vanilla kernel, could/should the effects of an 
 improved rwsem implementation be measured.
 
 Thanks,
 
   Ingo
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-09 Thread Linus Torvalds
On Wed, Oct 9, 2013 at 12:28 AM, Peter Zijlstra pet...@infradead.org wrote:

 The workload that I got the report from was a virus scanner, it would
 spawn nr_cpus threads and {mmap file, scan content, munmap} through your
 filesystem.

So I suspect we could make the mmap_sem write area *much* smaller for
the normal cases.

Look at do_mmap_pgoff(), for example: it is run entirely under
mmap_sem, but 99% of what it does doesn't actually need the lock.

The part that really needs the lock is

addr = get_unmapped_area(file, addr, len, pgoff, flags);
addr = mmap_region(file, addr, len, vm_flags, pgoff);

but we hold it over all the other stuff too.

In fact, even if we moved the mmap_sem down into do_mmap(), and moved
code around a bit to only hold it over those functions, it would still
cover unnecessarily much. For example, while merging is common, not
merging is pretty common too, and we do that

vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);

allocation under the lock. We could easily do things like preallocate
it outside the lock.

Right now mmap_sem covers pretty much the whole system call (we do do
some security checks outside of it).

I think the main issue is that nobody has ever cared deeply enough to
see how far this could be pushed. I suspect there is some low-hanging
fruit for anybody who is willing to handle the pain..

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-09 Thread Davidlohr Bueso
On Wed, 2013-10-09 at 20:14 -0700, Linus Torvalds wrote:
 On Wed, Oct 9, 2013 at 12:28 AM, Peter Zijlstra pet...@infradead.org wrote:
 
  The workload that I got the report from was a virus scanner, it would
  spawn nr_cpus threads and {mmap file, scan content, munmap} through your
  filesystem.
 
 So I suspect we could make the mmap_sem write area *much* smaller for
 the normal cases.
 
 Look at do_mmap_pgoff(), for example: it is run entirely under
 mmap_sem, but 99% of what it does doesn't actually need the lock.
 
 The part that really needs the lock is
 
 addr = get_unmapped_area(file, addr, len, pgoff, flags);
 addr = mmap_region(file, addr, len, vm_flags, pgoff);
 
 but we hold it over all the other stuff too.
 

True. By looking at the callers, we're always doing:

down_write(mm-mmap_sem);
do_mmap_pgoff()
...
up_write(mm-mmap_sem);

That goes for shm, aio, and of course mmap_pgoff().

While I know you hate two level locking, one way to go about this is to
take the lock inside do_mmap_pgoff() after the initial checks (flags,
page align, etc.) and return with the lock held, leaving the caller to
unlock it. 

 In fact, even if we moved the mmap_sem down into do_mmap(), and moved
 code around a bit to only hold it over those functions, it would still
 cover unnecessarily much. For example, while merging is common, not
 merging is pretty common too, and we do that
 
 vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
 
 allocation under the lock. We could easily do things like preallocate
 it outside the lock.
 

AFAICT there are also checks that should be done at the beginning of the
function, such as checking for MAP_LOCKED and VM_LOCKED flags before
calling get_unmapped_area().

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-07 Thread Tim Chen
On Thu, 2013-10-03 at 09:32 +0200, Ingo Molnar wrote:
> * Tim Chen  wrote:
> 
> > For version 8 of the patchset, we included the patch from Waiman to 
> > streamline wakeup operations and also optimize the MCS lock used in 
> > rwsem and mutex.
> 
> I'd be feeling a lot easier about this patch series if you also had 
> performance figures that show how mmap_sem is affected.
> 
> These:
> 
> > Tim got the following improvement for exim mail server 
> > workload on 40 core system:
> > 
> > Alex+Tim's patchset:   +4.8%
> > Alex+Tim+Waiman's patchset:+5.3%
> 
> appear to be mostly related to the anon_vma->rwsem. But once that lock is 
> changed to an rwlock_t, this measurement falls away.
> 
> Peter Zijlstra suggested the following testcase:
> 
> ===>
> In fact, try something like this from userspace:
> 
> n-threads:
> 
>   pthread_mutex_lock();
>   foo = mmap();
>   pthread_mutex_lock();
> 
>   /* work */
> 
>   pthread_mutex_unlock();
>   munma(foo);
>   pthread_mutex_unlock();
> 
> vs
> 
> n-threads:
> 
>   foo = mmap();
>   /* work */
>   munmap(foo);


Ingo,

I ran the vanilla kernel, the kernel with all rwsem patches and the
kernel with all patches except the optimistic spin one.  
I am listing two presentations of the data.  Please note that
there is about 5% run-run variation.

% change in performance vs vanilla kernel
#threadsall without optspin
mmap only   
1   1.9%1.6%
5   43.8%   2.6%
10  22.7%   -3.0%
20  -12.0%  -4.5%
40  -26.9%  -2.0%
mmap with mutex acquisition 
1   -2.1%   -3.0%
5   -1.9%   1.0%
10  4.2%12.5%
20  -4.1%   0.6%
40  -2.8%   -1.9%

The optimistic spin case does very well at low to moderate contentions,
but worse when there are very heavy contentions for the pure mmap case.
For the case with pthread mutex, there's not much change from vanilla
kernel.

% change in performance of the mmap with pthread-mutex vs pure mmap
#threadsvanilla all without optspin
1   3.0%-1.0%   -1.7%
5   7.2%-26.8%  5.5%
10  5.2%-10.6%  22.1%
20  6.8%16.4%   12.5%
40  -0.2%   32.7%   0.0%

In general, vanilla and no-optspin case perform better with 
pthread-mutex.  For the case with optspin, mmap with 
pthread-mutex is worse at low to moderate contention and better
at high contention.

Tim

> 
> I've had reports that the former was significantly faster than the
> latter.
> <===
> 
> this could be put into a standalone testcase, or you could add it as a new 
> subcommand of 'perf bench', which already has some pthread code, see for 
> example in tools/perf/bench/sched-messaging.c. Adding:
> 
>perf bench mm threads
> 
> or so would be a natural thing to have.
> 
> Thanks,
> 
>   Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-07 Thread Tim Chen
On Thu, 2013-10-03 at 09:32 +0200, Ingo Molnar wrote:
 * Tim Chen tim.c.c...@linux.intel.com wrote:
 
  For version 8 of the patchset, we included the patch from Waiman to 
  streamline wakeup operations and also optimize the MCS lock used in 
  rwsem and mutex.
 
 I'd be feeling a lot easier about this patch series if you also had 
 performance figures that show how mmap_sem is affected.
 
 These:
 
  Tim got the following improvement for exim mail server 
  workload on 40 core system:
  
  Alex+Tim's patchset:   +4.8%
  Alex+Tim+Waiman's patchset:+5.3%
 
 appear to be mostly related to the anon_vma-rwsem. But once that lock is 
 changed to an rwlock_t, this measurement falls away.
 
 Peter Zijlstra suggested the following testcase:
 
 ===
 In fact, try something like this from userspace:
 
 n-threads:
 
   pthread_mutex_lock(mutex);
   foo = mmap();
   pthread_mutex_lock(mutex);
 
   /* work */
 
   pthread_mutex_unlock(mutex);
   munma(foo);
   pthread_mutex_unlock(mutex);
 
 vs
 
 n-threads:
 
   foo = mmap();
   /* work */
   munmap(foo);


Ingo,

I ran the vanilla kernel, the kernel with all rwsem patches and the
kernel with all patches except the optimistic spin one.  
I am listing two presentations of the data.  Please note that
there is about 5% run-run variation.

% change in performance vs vanilla kernel
#threadsall without optspin
mmap only   
1   1.9%1.6%
5   43.8%   2.6%
10  22.7%   -3.0%
20  -12.0%  -4.5%
40  -26.9%  -2.0%
mmap with mutex acquisition 
1   -2.1%   -3.0%
5   -1.9%   1.0%
10  4.2%12.5%
20  -4.1%   0.6%
40  -2.8%   -1.9%

The optimistic spin case does very well at low to moderate contentions,
but worse when there are very heavy contentions for the pure mmap case.
For the case with pthread mutex, there's not much change from vanilla
kernel.

% change in performance of the mmap with pthread-mutex vs pure mmap
#threadsvanilla all without optspin
1   3.0%-1.0%   -1.7%
5   7.2%-26.8%  5.5%
10  5.2%-10.6%  22.1%
20  6.8%16.4%   12.5%
40  -0.2%   32.7%   0.0%

In general, vanilla and no-optspin case perform better with 
pthread-mutex.  For the case with optspin, mmap with 
pthread-mutex is worse at low to moderate contention and better
at high contention.

Tim

 
 I've had reports that the former was significantly faster than the
 latter.
 ===
 
 this could be put into a standalone testcase, or you could add it as a new 
 subcommand of 'perf bench', which already has some pthread code, see for 
 example in tools/perf/bench/sched-messaging.c. Adding:
 
perf bench mm threads
 
 or so would be a natural thing to have.
 
 Thanks,
 
   Ingo
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-03 Thread Ingo Molnar

* Tim Chen  wrote:

> For version 8 of the patchset, we included the patch from Waiman to 
> streamline wakeup operations and also optimize the MCS lock used in 
> rwsem and mutex.

I'd be feeling a lot easier about this patch series if you also had 
performance figures that show how mmap_sem is affected.

These:

> Tim got the following improvement for exim mail server 
> workload on 40 core system:
> 
> Alex+Tim's patchset: +4.8%
> Alex+Tim+Waiman's patchset:+5.3%

appear to be mostly related to the anon_vma->rwsem. But once that lock is 
changed to an rwlock_t, this measurement falls away.

Peter Zijlstra suggested the following testcase:

===>
In fact, try something like this from userspace:

n-threads:

  pthread_mutex_lock();
  foo = mmap();
  pthread_mutex_lock();

  /* work */

  pthread_mutex_unlock();
  munma(foo);
  pthread_mutex_unlock();

vs

n-threads:

  foo = mmap();
  /* work */
  munmap(foo);

I've had reports that the former was significantly faster than the
latter.
<===

this could be put into a standalone testcase, or you could add it as a new 
subcommand of 'perf bench', which already has some pthread code, see for 
example in tools/perf/bench/sched-messaging.c. Adding:

   perf bench mm threads

or so would be a natural thing to have.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-03 Thread Ingo Molnar

* Tim Chen tim.c.c...@linux.intel.com wrote:

 For version 8 of the patchset, we included the patch from Waiman to 
 streamline wakeup operations and also optimize the MCS lock used in 
 rwsem and mutex.

I'd be feeling a lot easier about this patch series if you also had 
performance figures that show how mmap_sem is affected.

These:

 Tim got the following improvement for exim mail server 
 workload on 40 core system:
 
 Alex+Tim's patchset: +4.8%
 Alex+Tim+Waiman's patchset:+5.3%

appear to be mostly related to the anon_vma-rwsem. But once that lock is 
changed to an rwlock_t, this measurement falls away.

Peter Zijlstra suggested the following testcase:

===
In fact, try something like this from userspace:

n-threads:

  pthread_mutex_lock(mutex);
  foo = mmap();
  pthread_mutex_lock(mutex);

  /* work */

  pthread_mutex_unlock(mutex);
  munma(foo);
  pthread_mutex_unlock(mutex);

vs

n-threads:

  foo = mmap();
  /* work */
  munmap(foo);

I've had reports that the former was significantly faster than the
latter.
===

this could be put into a standalone testcase, or you could add it as a new 
subcommand of 'perf bench', which already has some pthread code, see for 
example in tools/perf/bench/sched-messaging.c. Adding:

   perf bench mm threads

or so would be a natural thing to have.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v8 0/9] rwsem performance optimizations

2013-10-02 Thread Tim Chen
For version 8 of the patchset, we included the patch from Waiman
to streamline wakeup operations and also optimize the MCS lock
used in rwsem and mutex.

In this patchset, we introduce three categories of optimizations to read
write semaphore.  The first four patches from Alex Shi reduce cache
bouncing of the sem->count field by doing a pre-read of the sem->count
and avoid cmpxchg if possible.

The next four patches from Tim, Davidlohr and Jason
introduce optimistic spinning logic similar to that in the
mutex code for the writer lock acquisition of rwsem. This addresses the
general 'mutexes out perform writer-rwsems' situations that has been
seen in more than one case.  Users now need not worry about performance
issues when choosing between these two locking mechanisms.  We have
also factored out the MCS lock originally in the mutex code into its
own file, and performed micro optimizations and corrected the memory
barriers so it could be used for general lock/unlock of critical
sections.
 
The last patch from Waiman help to streamline the wake up operation
by avoiding multiple threads all doing wakeup operations when only
one wakeup thread is enough.  This significantly reduced lock
contentions from multiple wakeup threads. 

Tim got the following improvement for exim mail server 
workload on 40 core system:

Alex+Tim's patchset:   +4.8%
Alex+Tim+Waiman's patchset:+5.3%

Without these optimizations, Davidlohr Bueso saw a -8% regression to
aim7's shared and high_systime workloads when he switched i_mmap_mutex
to rwsem.  Tests were on 8 socket 80 cores system.  With Alex
and Tim's patches, he got significant improvements to the aim7 
suite instead of regressions:

alltests (+16.3%), custom (+20%), disk (+19.5%), high_systime (+7%),
shared (+18.4%) and short (+6.3%).

More Aim7 numbers will be posted when Davidlohr has a chance
to test the complete patchset including Waiman's patch.

Thanks to Ingo Molnar, Peter Hurley, Peter Zijlstra and Paul McKenney
for helping to review this patchset.

Tim

Changelog:

v8:
1. Added Waiman's patch to avoid multiple wakeup thread lock contention.
2. Micro-optimizations of MCS lock.
3. Correct the barriers of MCS lock to prevent critical sections from
leaking.

v7:
1. Rename mcslock.h to mcs_spinlock.h and also rename mcs related fields
with mcs prefix.
2. Properly define type of *mcs_lock field instead of leaving it as *void.
3. Added breif explanation of mcs lock.

v6:
1. Fix missing mcslock.h file.
2. Fix various code style issues.

v5:
1. Try optimistic spinning before we put the writer on the wait queue
to avoid bottlenecking at wait queue.  This provides 5% boost to exim workload
and between 2% to 8% boost to aim7.
2. Put MCS locking code into its own mcslock.h file for better reuse
between mutex.c and rwsem.c
3. Remove the configuration RWSEM_SPIN_ON_WRITE_OWNER and make the
operations default per Ingo's suggestions.

v4:
1. Fixed a bug in task_struct definition in rwsem_can_spin_on_owner
2. Fix another typo for RWSEM_SPIN_ON_WRITE_OWNER config option

v3:
1. Added ACCESS_ONCE to sem->count access in rwsem_can_spin_on_owner.
2. Fix typo bug for RWSEM_SPIN_ON_WRITE_OWNER option in init/Kconfig

v2:
1. Reorganize changes to down_write_trylock and do_wake into 4 patches and fixed
   a bug referencing >count when sem->count is intended.
2. Fix unsafe sem->owner de-reference in rwsem_can_spin_on_owner.
the option to be on for more seasoning but can be turned off should it be 
detrimental.
3. Various patch comments update

Alex Shi (4):
  rwsem: check the lock before cpmxchg in down_write_trylock
  rwsem: remove 'out' label in do_wake
  rwsem: remove try_reader_grant label do_wake
  rwsem/wake: check lock before do atomic update

Jason Low (2):
  MCS Lock: optimizations and extra comments
  MCS Lock: Barrier corrections

Tim Chen (2):
  MCS Lock: Restructure the MCS lock defines and locking code into its
own file
  rwsem: do optimistic spinning for writer lock acquisition

Waiman Long (1):
  rwsem: reduce spinlock contention in wakeup code path

 include/asm-generic/rwsem.h  |8 +-
 include/linux/mcs_spinlock.h |   82 ++
 include/linux/mutex.h|5 +-
 include/linux/rwsem.h|9 ++-
 kernel/mutex.c   |   60 +-
 kernel/rwsem.c   |   19 +++-
 lib/rwsem.c  |  255 +-
 7 files changed, 349 insertions(+), 89 deletions(-)
 create mode 100644 include/linux/mcs_spinlock.h

-- 
1.7.4.4


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v8 0/9] rwsem performance optimizations

2013-10-02 Thread Tim Chen
For version 8 of the patchset, we included the patch from Waiman
to streamline wakeup operations and also optimize the MCS lock
used in rwsem and mutex.

In this patchset, we introduce three categories of optimizations to read
write semaphore.  The first four patches from Alex Shi reduce cache
bouncing of the sem-count field by doing a pre-read of the sem-count
and avoid cmpxchg if possible.

The next four patches from Tim, Davidlohr and Jason
introduce optimistic spinning logic similar to that in the
mutex code for the writer lock acquisition of rwsem. This addresses the
general 'mutexes out perform writer-rwsems' situations that has been
seen in more than one case.  Users now need not worry about performance
issues when choosing between these two locking mechanisms.  We have
also factored out the MCS lock originally in the mutex code into its
own file, and performed micro optimizations and corrected the memory
barriers so it could be used for general lock/unlock of critical
sections.
 
The last patch from Waiman help to streamline the wake up operation
by avoiding multiple threads all doing wakeup operations when only
one wakeup thread is enough.  This significantly reduced lock
contentions from multiple wakeup threads. 

Tim got the following improvement for exim mail server 
workload on 40 core system:

Alex+Tim's patchset:   +4.8%
Alex+Tim+Waiman's patchset:+5.3%

Without these optimizations, Davidlohr Bueso saw a -8% regression to
aim7's shared and high_systime workloads when he switched i_mmap_mutex
to rwsem.  Tests were on 8 socket 80 cores system.  With Alex
and Tim's patches, he got significant improvements to the aim7 
suite instead of regressions:

alltests (+16.3%), custom (+20%), disk (+19.5%), high_systime (+7%),
shared (+18.4%) and short (+6.3%).

More Aim7 numbers will be posted when Davidlohr has a chance
to test the complete patchset including Waiman's patch.

Thanks to Ingo Molnar, Peter Hurley, Peter Zijlstra and Paul McKenney
for helping to review this patchset.

Tim

Changelog:

v8:
1. Added Waiman's patch to avoid multiple wakeup thread lock contention.
2. Micro-optimizations of MCS lock.
3. Correct the barriers of MCS lock to prevent critical sections from
leaking.

v7:
1. Rename mcslock.h to mcs_spinlock.h and also rename mcs related fields
with mcs prefix.
2. Properly define type of *mcs_lock field instead of leaving it as *void.
3. Added breif explanation of mcs lock.

v6:
1. Fix missing mcslock.h file.
2. Fix various code style issues.

v5:
1. Try optimistic spinning before we put the writer on the wait queue
to avoid bottlenecking at wait queue.  This provides 5% boost to exim workload
and between 2% to 8% boost to aim7.
2. Put MCS locking code into its own mcslock.h file for better reuse
between mutex.c and rwsem.c
3. Remove the configuration RWSEM_SPIN_ON_WRITE_OWNER and make the
operations default per Ingo's suggestions.

v4:
1. Fixed a bug in task_struct definition in rwsem_can_spin_on_owner
2. Fix another typo for RWSEM_SPIN_ON_WRITE_OWNER config option

v3:
1. Added ACCESS_ONCE to sem-count access in rwsem_can_spin_on_owner.
2. Fix typo bug for RWSEM_SPIN_ON_WRITE_OWNER option in init/Kconfig

v2:
1. Reorganize changes to down_write_trylock and do_wake into 4 patches and fixed
   a bug referencing sem-count when sem-count is intended.
2. Fix unsafe sem-owner de-reference in rwsem_can_spin_on_owner.
the option to be on for more seasoning but can be turned off should it be 
detrimental.
3. Various patch comments update

Alex Shi (4):
  rwsem: check the lock before cpmxchg in down_write_trylock
  rwsem: remove 'out' label in do_wake
  rwsem: remove try_reader_grant label do_wake
  rwsem/wake: check lock before do atomic update

Jason Low (2):
  MCS Lock: optimizations and extra comments
  MCS Lock: Barrier corrections

Tim Chen (2):
  MCS Lock: Restructure the MCS lock defines and locking code into its
own file
  rwsem: do optimistic spinning for writer lock acquisition

Waiman Long (1):
  rwsem: reduce spinlock contention in wakeup code path

 include/asm-generic/rwsem.h  |8 +-
 include/linux/mcs_spinlock.h |   82 ++
 include/linux/mutex.h|5 +-
 include/linux/rwsem.h|9 ++-
 kernel/mutex.c   |   60 +-
 kernel/rwsem.c   |   19 +++-
 lib/rwsem.c  |  255 +-
 7 files changed, 349 insertions(+), 89 deletions(-)
 create mode 100644 include/linux/mcs_spinlock.h

-- 
1.7.4.4


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/