Re: [PATCH v8 0/9] rwsem performance optimizations
Ingo, Sorry for the late response. My old 4 socket Westmere test machine went down and I have to find a new one, which is a 4 socket Ivybridge machine with 15 cores per socket. I've updated the workload as a perf benchmark (see patch) attached. The workload will mmap, then access memory in the mmaped area and then unmap, doing so repeatedly for a specified time. Each thread is pinned to a particular core, with the threads distributed evenly between the sockets. The throughput is reported with standard deviation info. First some baseline comparing the workload with serialized mmap vs without serialized mmap running under vanilla kernel. Threads Throughput std dev(%) serail vs non serial mmap(%) 1 0.100.16 2 0.780.09 3 -5.00 0.12 4 -3.27 0.08 5 -0.11 0.09 10 5.320.10 20 -2.05 0.05 40 -9.75 0.15 60 11.69 0.05 Here's the data for complete rwsem patch vs the plain vanilla kernel case. Overall there's improvement except for the 3 thread case. Threads Throughput std dev(%) vs vanilla(%) 1 0.620.11 2 3.860.10 3 -7.02 0.19 4 -0.01 0.13 5 2.740.06 10 5.660.03 20 1.440.09 40 5.540.09 60 15.63 0.13 Now testing with both patched kernel and vanilla kernel running serialized mmap with mutex acquisition in user space. Threads Throughput std dev(%) vs vanilla(%) 1 0.600.02 2 6.400.11 3 14.13 0.07 4 -2.41 0.07 5 1.050.08 10 4.150.05 20 -0.26 0.06 40 -3.45 0.13 60 -4.33 0.07 Here's another run with the rwsem patchset without optimistic spinning Threads Throughput std dev(%) vs vanilla(%) 1 0.810.04 2 2.850.17 3 -4.09 0.05 4 -8.31 0.07 5 -3.19 0.03 10 1.020.05 20 -4.77 0.04 40 -3.11 0.10 60 2.060.10 No-optspin comparing serialized mmaped workload under patched kernel vs vanilla kernel Threads Throughput std dev(%) vs vanilla(%) 1 0.570.03 2 2.130.17 3 14.78 0.33 4 -1.23 0.11 5 2.990.08 10 -0.43 0.10 20 0.010.03 40 3.030.10 60 -1.74 0.09 The data is a bit of a mixed bag. I'll spin off the MCS cleanup patch separately so we can merge that first for Waiman's qrwlock work. Tim --- >From 6c5916315c1515fb2281d9344b2c4f371ca99879 Mon Sep 17 00:00:00 2001 From: Tim Chen Date: Wed, 30 Oct 2013 05:18:29 -0700 Subject: [PATCH] perf mmap and memory write test This patch add a perf benchmark to mmap a piece of memory, write to the memory and unmap the memory for a given number of threads. The threads are distributed and pinned evenly to the sockets on the machine. The options for the benchmark are as follow: usage: perf bench mem mmap -l, --length <1MB>Specify length of memory to set. Available units: B, KB, MB, GB and TB (upper and lower) -i, --iterations repeat mmap() invocation this number of times -n, --threads number of threads doing mmap() invocation -r, --runtime runtime per iteration in sec -w, --warmup warmup time in sec -s, --serialize serialize the mmap() operations with mutex -v, --verbose verbose output giving info about each iteration Signed-off-by: Tim Chen --- tools/perf/Makefile | 1 + tools/perf/bench/bench.h| 1 + tools/perf/bench/mem-mmap.c | 312 tools/perf/builtin-bench.c | 3 + 4 files changed, 317 insertions(+) create mode 100644 tools/perf/bench/mem-mmap.c diff --git a/tools/perf/Makefile b/tools/perf/Makefile index 64c043b..80e32d1 100644 --- a/tools/perf/Makefile +++ b/tools/perf/Makefile @@ -408,6 +408,7 @@ BUILTIN_OBJS += $(OUTPUT)bench/mem-memset-x86-64-asm.o endif BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy.o BUILTIN_OBJS += $(OUTPUT)bench/mem-memset.o +BUILTIN_OBJS += $(OUTPUT)bench/mem-mmap.o BUILTIN_OBJS += $(OUTPUT)builtin-diff.o BUILTIN_OBJS += $(OUTPUT)builtin-evlist.o diff --git
Re: [PATCH v8 0/9] rwsem performance optimizations
Ingo, Sorry for the late response. My old 4 socket Westmere test machine went down and I have to find a new one, which is a 4 socket Ivybridge machine with 15 cores per socket. I've updated the workload as a perf benchmark (see patch) attached. The workload will mmap, then access memory in the mmaped area and then unmap, doing so repeatedly for a specified time. Each thread is pinned to a particular core, with the threads distributed evenly between the sockets. The throughput is reported with standard deviation info. First some baseline comparing the workload with serialized mmap vs without serialized mmap running under vanilla kernel. Threads Throughput std dev(%) serail vs non serial mmap(%) 1 0.100.16 2 0.780.09 3 -5.00 0.12 4 -3.27 0.08 5 -0.11 0.09 10 5.320.10 20 -2.05 0.05 40 -9.75 0.15 60 11.69 0.05 Here's the data for complete rwsem patch vs the plain vanilla kernel case. Overall there's improvement except for the 3 thread case. Threads Throughput std dev(%) vs vanilla(%) 1 0.620.11 2 3.860.10 3 -7.02 0.19 4 -0.01 0.13 5 2.740.06 10 5.660.03 20 1.440.09 40 5.540.09 60 15.63 0.13 Now testing with both patched kernel and vanilla kernel running serialized mmap with mutex acquisition in user space. Threads Throughput std dev(%) vs vanilla(%) 1 0.600.02 2 6.400.11 3 14.13 0.07 4 -2.41 0.07 5 1.050.08 10 4.150.05 20 -0.26 0.06 40 -3.45 0.13 60 -4.33 0.07 Here's another run with the rwsem patchset without optimistic spinning Threads Throughput std dev(%) vs vanilla(%) 1 0.810.04 2 2.850.17 3 -4.09 0.05 4 -8.31 0.07 5 -3.19 0.03 10 1.020.05 20 -4.77 0.04 40 -3.11 0.10 60 2.060.10 No-optspin comparing serialized mmaped workload under patched kernel vs vanilla kernel Threads Throughput std dev(%) vs vanilla(%) 1 0.570.03 2 2.130.17 3 14.78 0.33 4 -1.23 0.11 5 2.990.08 10 -0.43 0.10 20 0.010.03 40 3.030.10 60 -1.74 0.09 The data is a bit of a mixed bag. I'll spin off the MCS cleanup patch separately so we can merge that first for Waiman's qrwlock work. Tim --- From 6c5916315c1515fb2281d9344b2c4f371ca99879 Mon Sep 17 00:00:00 2001 From: Tim Chen tim.c.c...@linux.intel.com Date: Wed, 30 Oct 2013 05:18:29 -0700 Subject: [PATCH] perf mmap and memory write test This patch add a perf benchmark to mmap a piece of memory, write to the memory and unmap the memory for a given number of threads. The threads are distributed and pinned evenly to the sockets on the machine. The options for the benchmark are as follow: usage: perf bench mem mmap options -l, --length 1MBSpecify length of memory to set. Available units: B, KB, MB, GB and TB (upper and lower) -i, --iterations n repeat mmap() invocation this number of times -n, --threads n number of threads doing mmap() invocation -r, --runtime n runtime per iteration in sec -w, --warmup n warmup time in sec -s, --serialize serialize the mmap() operations with mutex -v, --verbose verbose output giving info about each iteration Signed-off-by: Tim Chen tim.c.c...@linux.intel.com --- tools/perf/Makefile | 1 + tools/perf/bench/bench.h| 1 + tools/perf/bench/mem-mmap.c | 312 tools/perf/builtin-bench.c | 3 + 4 files changed, 317 insertions(+) create mode 100644 tools/perf/bench/mem-mmap.c diff --git a/tools/perf/Makefile b/tools/perf/Makefile index 64c043b..80e32d1 100644 --- a/tools/perf/Makefile +++ b/tools/perf/Makefile @@ -408,6 +408,7 @@ BUILTIN_OBJS += $(OUTPUT)bench/mem-memset-x86-64-asm.o endif BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy.o BUILTIN_OBJS += $(OUTPUT)bench/mem-memset.o +BUILTIN_OBJS += $(OUTPUT)bench/mem-mmap.o BUILTIN_OBJS += $(OUTPUT)builtin-diff.o
Re: [PATCH v8 0/9] rwsem performance optimizations
* Tim Chen wrote: > > > > > It would be _really_ nice to stick this into tools/perf/bench/ as: > > > > perf bench mem pagefaults > > > > or so, with a number of parallelism and workload patterns. See > > tools/perf/bench/numa.c for a couple of workload generators - although > > those are not page fault intense. > > > > So that future generations can run all these tests too and such. > > > > > I compare the throughput where I have the complete rwsem patchset > > > against vanilla and the case where I take out the optimistic spin patch. > > > I have increased the run time by 10x from my pervious experiments and do > > > 10 runs for each case. The standard deviation is ~1.5% so any changes > > > under 1.5% is statistically significant. > > > > > > % change in throughput vs the vanilla kernel. > > > Threads all No-optspin > > > 1 +0.4% -0.1% > > > 2 +2.0% +0.2% > > > 3 +1.1% +1.5% > > > 4 -0.5% -1.4% > > > 5 -0.1% -0.1% > > > 10+2.2% -1.2% > > > 20+237.3% -2.3% > > > 40+548.1% +0.3% > > > > The tail is impressive. The early parts are important as well, but it's > > really hard to tell the significance of the early portion without having > > an sttdev column. > > > > ( "perf stat --repeat N" will give you sttdev output, in handy percentage > > form. ) > > Quick naive question as I haven't hacked perf bench before. Btw., please use tip:master, I've got a few cleanups in there that should make it easier to hack. > Now perf stat gives the statistics of the performance counter or events. > How do I get it to compute the stats of > the throughput reported by perf bench? What I do is that I measure the execution time, via: perf stat --null --repeat 10 perf bench ... instead of relying on benchmark output. > Something like > > perf stat -r 10 -- perf bench mm memset --iterations 10 > > doesn't quite give what I need. Yeha. So, perf bench also has a 'simple' output format: comet:~/tip> perf bench -f simple sched pipe 10.378 We could extend 'perf stat' with an option to not measure time, but to take any numeric data output from the executed task and use that as the measurement result. If you'd be interested in such a feature I can give it a try. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
* Tim Chen tim.c.c...@linux.intel.com wrote: It would be _really_ nice to stick this into tools/perf/bench/ as: perf bench mem pagefaults or so, with a number of parallelism and workload patterns. See tools/perf/bench/numa.c for a couple of workload generators - although those are not page fault intense. So that future generations can run all these tests too and such. I compare the throughput where I have the complete rwsem patchset against vanilla and the case where I take out the optimistic spin patch. I have increased the run time by 10x from my pervious experiments and do 10 runs for each case. The standard deviation is ~1.5% so any changes under 1.5% is statistically significant. % change in throughput vs the vanilla kernel. Threads all No-optspin 1 +0.4% -0.1% 2 +2.0% +0.2% 3 +1.1% +1.5% 4 -0.5% -1.4% 5 -0.1% -0.1% 10+2.2% -1.2% 20+237.3% -2.3% 40+548.1% +0.3% The tail is impressive. The early parts are important as well, but it's really hard to tell the significance of the early portion without having an sttdev column. ( perf stat --repeat N will give you sttdev output, in handy percentage form. ) Quick naive question as I haven't hacked perf bench before. Btw., please use tip:master, I've got a few cleanups in there that should make it easier to hack. Now perf stat gives the statistics of the performance counter or events. How do I get it to compute the stats of the throughput reported by perf bench? What I do is that I measure the execution time, via: perf stat --null --repeat 10 perf bench ... instead of relying on benchmark output. Something like perf stat -r 10 -- perf bench mm memset --iterations 10 doesn't quite give what I need. Yeha. So, perf bench also has a 'simple' output format: comet:~/tip perf bench -f simple sched pipe 10.378 We could extend 'perf stat' with an option to not measure time, but to take any numeric data output from the executed task and use that as the measurement result. If you'd be interested in such a feature I can give it a try. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
> > It would be _really_ nice to stick this into tools/perf/bench/ as: > > perf bench mem pagefaults > > or so, with a number of parallelism and workload patterns. See > tools/perf/bench/numa.c for a couple of workload generators - although > those are not page fault intense. > > So that future generations can run all these tests too and such. > > > I compare the throughput where I have the complete rwsem patchset > > against vanilla and the case where I take out the optimistic spin patch. > > I have increased the run time by 10x from my pervious experiments and do > > 10 runs for each case. The standard deviation is ~1.5% so any changes > > under 1.5% is statistically significant. > > > > % change in throughput vs the vanilla kernel. > > Threads all No-optspin > > 1 +0.4% -0.1% > > 2 +2.0% +0.2% > > 3 +1.1% +1.5% > > 4 -0.5% -1.4% > > 5 -0.1% -0.1% > > 10 +2.2% -1.2% > > 20 +237.3% -2.3% > > 40 +548.1% +0.3% > > The tail is impressive. The early parts are important as well, but it's > really hard to tell the significance of the early portion without having > an sttdev column. > > ( "perf stat --repeat N" will give you sttdev output, in handy percentage > form. ) Quick naive question as I haven't hacked perf bench before. Now perf stat gives the statistics of the performance counter or events. How do I get it to compute the stats of the throughput reported by perf bench? Something like perf stat -r 10 -- perf bench mm memset --iterations 10 doesn't quite give what I need. Pointers appreciated. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
On Wed, 2013-10-16 at 08:55 +0200, Ingo Molnar wrote: > * Tim Chen wrote: > > > On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote: > > > * Tim Chen wrote: > > > > > > > The throughput of pure mmap with mutex is below vs pure mmap is below: > > > > > > > > % change in performance of the mmap with pthread-mutex vs pure mmap > > > > #threadsvanilla all rwsem without optspin > > > > patches > > > > 1 3.0%-1.0% -1.7% > > > > 5 7.2%-26.8% 5.5% > > > > 10 5.2%-10.6% 22.1% > > > > 20 6.8%16.4% 12.5% > > > > 40 -0.2% 32.7% 0.0% > > > > > > > > So with mutex, the vanilla kernel and the one without optspin both run > > > > faster. This is consistent with what Peter reported. With optspin, > > > > the > > > > picture is more mixed, with lower throughput at low to moderate number > > > > of threads and higher throughput with high number of threads. > > > > > > So, going back to your orignal table: > > > > > > > % change in performance of the mmap with pthread-mutex vs pure mmap > > > > #threadsvanilla all without optspin > > > > 1 3.0%-1.0% -1.7% > > > > 5 7.2%-26.8% 5.5% > > > > 10 5.2%-10.6% 22.1% > > > > 20 6.8%16.4% 12.5% > > > > 40 -0.2% 32.7% 0.0% > > > > > > > > In general, vanilla and no-optspin case perform better with > > > > pthread-mutex. For the case with optspin, mmap with pthread-mutex is > > > > worse at low to moderate contention and better at high contention. > > > > > > it appears that 'without optspin' appears to be a pretty good choice - if > > > it wasn't for that '1 thread' number, which, if I correctly assume is the > > > uncontended case, is one of the most common usecases ... > > > > > > How can the single-threaded case get slower? None of the patches should > > > really cause noticeable overhead in the non-contended case. That looks > > > weird. > > > > > > It would also be nice to see the 2, 3, 4 thread numbers - those are the > > > most common contention scenarios in practice - where do we see the first > > > improvement in performance? > > > > > > Also, it would be nice to include a noise/sttdev figure, it's really hard > > > to tell whether -1.7% is statistically significant. > > > > Ingo, > > > > I think that the optimistic spin changes to rwsem should enhance > > performance to real workloads after all. > > > > In my previous tests, I was doing mmap followed immediately by > > munmap without doing anything to the memory. No real workload > > will behave that way and it is not the scenario that we > > should optimize for. A much better approximation of > > real usages will be doing mmap, then touching > > the memories being mmaped, followed by munmap. > > That's why I asked for a working testcase to be posted ;-) Not just > pseudocode - send the real .c thing please. I was using a modified version of Anton's will-it-scale test. I'll try to port the tests to perf bench to make it easier for other people to run the tests. > > > This changes the dynamics of the rwsem as we are now dominated by read > > acquisitions of mmap sem due to the page faults, instead of having only > > write acquisitions from mmap. [...] > > Absolutely, the page fault read case is the #1 optimization target of > rwsems. > > > [...] In this case, any delay in write acquisitions will be costly as we > > will be blocking a lot of readers. This is where optimistic spinning on > > write acquisitions of mmap sem can provide a very significant boost to > > the throughput. > > > > I change the test case to the following with writes to > > the mmaped memory: > > > > #define MEMSIZE (1 * 1024 * 1024) > > > > char *testcase_description = "Anonymous memory mmap/munmap of 1MB"; > > > > void testcase(unsigned long long *iterations) > > { > > int i; > > > > while (1) { > > char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, > >MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); > > assert(c != MAP_FAILED); > > for (i=0; i > c[i] = 0xa; > > } > > munmap(c, MEMSIZE); > > > > (*iterations)++; > > } > > } > > It would be _really_ nice to stick this into tools/perf/bench/ as: > > perf bench mem pagefaults > > or so, with a number of parallelism and workload patterns. See > tools/perf/bench/numa.c for a couple of workload generators - although > those are not page fault intense. > > So that future generations can run all these tests too and such. Okay, will do. > > > I compare the throughput where I have the complete rwsem patchset > > against vanilla and the case where I take
Re: [PATCH v8 0/9] rwsem performance optimizations
* Tim Chen wrote: > On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote: > > * Tim Chen wrote: > > > > > The throughput of pure mmap with mutex is below vs pure mmap is below: > > > > > > % change in performance of the mmap with pthread-mutex vs pure mmap > > > #threadsvanilla all rwsem without optspin > > > patches > > > 1 3.0% -1.0% -1.7% > > > 5 7.2% -26.8% 5.5% > > > 10 5.2% -10.6% 22.1% > > > 20 6.8% 16.4% 12.5% > > > 40 -0.2% 32.7% 0.0% > > > > > > So with mutex, the vanilla kernel and the one without optspin both run > > > faster. This is consistent with what Peter reported. With optspin, the > > > picture is more mixed, with lower throughput at low to moderate number > > > of threads and higher throughput with high number of threads. > > > > So, going back to your orignal table: > > > > > % change in performance of the mmap with pthread-mutex vs pure mmap > > > #threadsvanilla all without optspin > > > 1 3.0%-1.0% -1.7% > > > 5 7.2%-26.8% 5.5% > > > 10 5.2%-10.6% 22.1% > > > 20 6.8%16.4% 12.5% > > > 40 -0.2% 32.7% 0.0% > > > > > > In general, vanilla and no-optspin case perform better with > > > pthread-mutex. For the case with optspin, mmap with pthread-mutex is > > > worse at low to moderate contention and better at high contention. > > > > it appears that 'without optspin' appears to be a pretty good choice - if > > it wasn't for that '1 thread' number, which, if I correctly assume is the > > uncontended case, is one of the most common usecases ... > > > > How can the single-threaded case get slower? None of the patches should > > really cause noticeable overhead in the non-contended case. That looks > > weird. > > > > It would also be nice to see the 2, 3, 4 thread numbers - those are the > > most common contention scenarios in practice - where do we see the first > > improvement in performance? > > > > Also, it would be nice to include a noise/sttdev figure, it's really hard > > to tell whether -1.7% is statistically significant. > > Ingo, > > I think that the optimistic spin changes to rwsem should enhance > performance to real workloads after all. > > In my previous tests, I was doing mmap followed immediately by > munmap without doing anything to the memory. No real workload > will behave that way and it is not the scenario that we > should optimize for. A much better approximation of > real usages will be doing mmap, then touching > the memories being mmaped, followed by munmap. That's why I asked for a working testcase to be posted ;-) Not just pseudocode - send the real .c thing please. > This changes the dynamics of the rwsem as we are now dominated by read > acquisitions of mmap sem due to the page faults, instead of having only > write acquisitions from mmap. [...] Absolutely, the page fault read case is the #1 optimization target of rwsems. > [...] In this case, any delay in write acquisitions will be costly as we > will be blocking a lot of readers. This is where optimistic spinning on > write acquisitions of mmap sem can provide a very significant boost to > the throughput. > > I change the test case to the following with writes to > the mmaped memory: > > #define MEMSIZE (1 * 1024 * 1024) > > char *testcase_description = "Anonymous memory mmap/munmap of 1MB"; > > void testcase(unsigned long long *iterations) > { > int i; > > while (1) { > char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, >MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); > assert(c != MAP_FAILED); > for (i=0; i c[i] = 0xa; > } > munmap(c, MEMSIZE); > > (*iterations)++; > } > } It would be _really_ nice to stick this into tools/perf/bench/ as: perf bench mem pagefaults or so, with a number of parallelism and workload patterns. See tools/perf/bench/numa.c for a couple of workload generators - although those are not page fault intense. So that future generations can run all these tests too and such. > I compare the throughput where I have the complete rwsem patchset > against vanilla and the case where I take out the optimistic spin patch. > I have increased the run time by 10x from my pervious experiments and do > 10 runs for each case. The standard deviation is ~1.5% so any changes > under 1.5% is statistically significant. > > % change in throughput vs the vanilla kernel. > Threads all No-optspin > 1 +0.4% -0.1% > 2 +2.0% +0.2% > 3 +1.1% +1.5% > 4 -0.5% -1.4% > 5 -0.1% -0.1% > 10+2.2% -1.2% > 20
Re: [PATCH v8 0/9] rwsem performance optimizations
* Tim Chen tim.c.c...@linux.intel.com wrote: On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote: * Tim Chen tim.c.c...@linux.intel.com wrote: The throughput of pure mmap with mutex is below vs pure mmap is below: % change in performance of the mmap with pthread-mutex vs pure mmap #threadsvanilla all rwsem without optspin patches 1 3.0% -1.0% -1.7% 5 7.2% -26.8% 5.5% 10 5.2% -10.6% 22.1% 20 6.8% 16.4% 12.5% 40 -0.2% 32.7% 0.0% So with mutex, the vanilla kernel and the one without optspin both run faster. This is consistent with what Peter reported. With optspin, the picture is more mixed, with lower throughput at low to moderate number of threads and higher throughput with high number of threads. So, going back to your orignal table: % change in performance of the mmap with pthread-mutex vs pure mmap #threadsvanilla all without optspin 1 3.0%-1.0% -1.7% 5 7.2%-26.8% 5.5% 10 5.2%-10.6% 22.1% 20 6.8%16.4% 12.5% 40 -0.2% 32.7% 0.0% In general, vanilla and no-optspin case perform better with pthread-mutex. For the case with optspin, mmap with pthread-mutex is worse at low to moderate contention and better at high contention. it appears that 'without optspin' appears to be a pretty good choice - if it wasn't for that '1 thread' number, which, if I correctly assume is the uncontended case, is one of the most common usecases ... How can the single-threaded case get slower? None of the patches should really cause noticeable overhead in the non-contended case. That looks weird. It would also be nice to see the 2, 3, 4 thread numbers - those are the most common contention scenarios in practice - where do we see the first improvement in performance? Also, it would be nice to include a noise/sttdev figure, it's really hard to tell whether -1.7% is statistically significant. Ingo, I think that the optimistic spin changes to rwsem should enhance performance to real workloads after all. In my previous tests, I was doing mmap followed immediately by munmap without doing anything to the memory. No real workload will behave that way and it is not the scenario that we should optimize for. A much better approximation of real usages will be doing mmap, then touching the memories being mmaped, followed by munmap. That's why I asked for a working testcase to be posted ;-) Not just pseudocode - send the real .c thing please. This changes the dynamics of the rwsem as we are now dominated by read acquisitions of mmap sem due to the page faults, instead of having only write acquisitions from mmap. [...] Absolutely, the page fault read case is the #1 optimization target of rwsems. [...] In this case, any delay in write acquisitions will be costly as we will be blocking a lot of readers. This is where optimistic spinning on write acquisitions of mmap sem can provide a very significant boost to the throughput. I change the test case to the following with writes to the mmaped memory: #define MEMSIZE (1 * 1024 * 1024) char *testcase_description = Anonymous memory mmap/munmap of 1MB; void testcase(unsigned long long *iterations) { int i; while (1) { char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); assert(c != MAP_FAILED); for (i=0; iMEMSIZE; i+=8) { c[i] = 0xa; } munmap(c, MEMSIZE); (*iterations)++; } } It would be _really_ nice to stick this into tools/perf/bench/ as: perf bench mem pagefaults or so, with a number of parallelism and workload patterns. See tools/perf/bench/numa.c for a couple of workload generators - although those are not page fault intense. So that future generations can run all these tests too and such. I compare the throughput where I have the complete rwsem patchset against vanilla and the case where I take out the optimistic spin patch. I have increased the run time by 10x from my pervious experiments and do 10 runs for each case. The standard deviation is ~1.5% so any changes under 1.5% is statistically significant. % change in throughput vs the vanilla kernel. Threads all No-optspin 1 +0.4% -0.1% 2 +2.0% +0.2% 3 +1.1% +1.5% 4 -0.5% -1.4% 5 -0.1% -0.1% 10+2.2% -1.2% 20+237.3% -2.3% 40+548.1% +0.3% The tail is impressive. The early parts are important as
Re: [PATCH v8 0/9] rwsem performance optimizations
On Wed, 2013-10-16 at 08:55 +0200, Ingo Molnar wrote: * Tim Chen tim.c.c...@linux.intel.com wrote: On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote: * Tim Chen tim.c.c...@linux.intel.com wrote: The throughput of pure mmap with mutex is below vs pure mmap is below: % change in performance of the mmap with pthread-mutex vs pure mmap #threadsvanilla all rwsem without optspin patches 1 3.0%-1.0% -1.7% 5 7.2%-26.8% 5.5% 10 5.2%-10.6% 22.1% 20 6.8%16.4% 12.5% 40 -0.2% 32.7% 0.0% So with mutex, the vanilla kernel and the one without optspin both run faster. This is consistent with what Peter reported. With optspin, the picture is more mixed, with lower throughput at low to moderate number of threads and higher throughput with high number of threads. So, going back to your orignal table: % change in performance of the mmap with pthread-mutex vs pure mmap #threadsvanilla all without optspin 1 3.0%-1.0% -1.7% 5 7.2%-26.8% 5.5% 10 5.2%-10.6% 22.1% 20 6.8%16.4% 12.5% 40 -0.2% 32.7% 0.0% In general, vanilla and no-optspin case perform better with pthread-mutex. For the case with optspin, mmap with pthread-mutex is worse at low to moderate contention and better at high contention. it appears that 'without optspin' appears to be a pretty good choice - if it wasn't for that '1 thread' number, which, if I correctly assume is the uncontended case, is one of the most common usecases ... How can the single-threaded case get slower? None of the patches should really cause noticeable overhead in the non-contended case. That looks weird. It would also be nice to see the 2, 3, 4 thread numbers - those are the most common contention scenarios in practice - where do we see the first improvement in performance? Also, it would be nice to include a noise/sttdev figure, it's really hard to tell whether -1.7% is statistically significant. Ingo, I think that the optimistic spin changes to rwsem should enhance performance to real workloads after all. In my previous tests, I was doing mmap followed immediately by munmap without doing anything to the memory. No real workload will behave that way and it is not the scenario that we should optimize for. A much better approximation of real usages will be doing mmap, then touching the memories being mmaped, followed by munmap. That's why I asked for a working testcase to be posted ;-) Not just pseudocode - send the real .c thing please. I was using a modified version of Anton's will-it-scale test. I'll try to port the tests to perf bench to make it easier for other people to run the tests. This changes the dynamics of the rwsem as we are now dominated by read acquisitions of mmap sem due to the page faults, instead of having only write acquisitions from mmap. [...] Absolutely, the page fault read case is the #1 optimization target of rwsems. [...] In this case, any delay in write acquisitions will be costly as we will be blocking a lot of readers. This is where optimistic spinning on write acquisitions of mmap sem can provide a very significant boost to the throughput. I change the test case to the following with writes to the mmaped memory: #define MEMSIZE (1 * 1024 * 1024) char *testcase_description = Anonymous memory mmap/munmap of 1MB; void testcase(unsigned long long *iterations) { int i; while (1) { char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); assert(c != MAP_FAILED); for (i=0; iMEMSIZE; i+=8) { c[i] = 0xa; } munmap(c, MEMSIZE); (*iterations)++; } } It would be _really_ nice to stick this into tools/perf/bench/ as: perf bench mem pagefaults or so, with a number of parallelism and workload patterns. See tools/perf/bench/numa.c for a couple of workload generators - although those are not page fault intense. So that future generations can run all these tests too and such. Okay, will do. I compare the throughput where I have the complete rwsem patchset against vanilla and the case where I take out the optimistic spin patch. I have increased the run time by 10x from my pervious experiments and do 10 runs for each case. The standard deviation is ~1.5% so any changes under 1.5% is statistically
Re: [PATCH v8 0/9] rwsem performance optimizations
It would be _really_ nice to stick this into tools/perf/bench/ as: perf bench mem pagefaults or so, with a number of parallelism and workload patterns. See tools/perf/bench/numa.c for a couple of workload generators - although those are not page fault intense. So that future generations can run all these tests too and such. I compare the throughput where I have the complete rwsem patchset against vanilla and the case where I take out the optimistic spin patch. I have increased the run time by 10x from my pervious experiments and do 10 runs for each case. The standard deviation is ~1.5% so any changes under 1.5% is statistically significant. % change in throughput vs the vanilla kernel. Threads all No-optspin 1 +0.4% -0.1% 2 +2.0% +0.2% 3 +1.1% +1.5% 4 -0.5% -1.4% 5 -0.1% -0.1% 10 +2.2% -1.2% 20 +237.3% -2.3% 40 +548.1% +0.3% The tail is impressive. The early parts are important as well, but it's really hard to tell the significance of the early portion without having an sttdev column. ( perf stat --repeat N will give you sttdev output, in handy percentage form. ) Quick naive question as I haven't hacked perf bench before. Now perf stat gives the statistics of the performance counter or events. How do I get it to compute the stats of the throughput reported by perf bench? Something like perf stat -r 10 -- perf bench mm memset --iterations 10 doesn't quite give what I need. Pointers appreciated. Tim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote: > * Tim Chen wrote: > > > The throughput of pure mmap with mutex is below vs pure mmap is below: > > > > % change in performance of the mmap with pthread-mutex vs pure mmap > > #threadsvanilla all rwsem without optspin > > patches > > 1 3.0%-1.0% -1.7% > > 5 7.2%-26.8% 5.5% > > 10 5.2%-10.6% 22.1% > > 20 6.8%16.4% 12.5% > > 40 -0.2% 32.7% 0.0% > > > > So with mutex, the vanilla kernel and the one without optspin both run > > faster. This is consistent with what Peter reported. With optspin, the > > picture is more mixed, with lower throughput at low to moderate number > > of threads and higher throughput with high number of threads. > > So, going back to your orignal table: > > > % change in performance of the mmap with pthread-mutex vs pure mmap > > #threadsvanilla all without optspin > > 1 3.0%-1.0% -1.7% > > 5 7.2%-26.8% 5.5% > > 10 5.2%-10.6% 22.1% > > 20 6.8%16.4% 12.5% > > 40 -0.2% 32.7% 0.0% > > > > In general, vanilla and no-optspin case perform better with > > pthread-mutex. For the case with optspin, mmap with pthread-mutex is > > worse at low to moderate contention and better at high contention. > > it appears that 'without optspin' appears to be a pretty good choice - if > it wasn't for that '1 thread' number, which, if I correctly assume is the > uncontended case, is one of the most common usecases ... > > How can the single-threaded case get slower? None of the patches should > really cause noticeable overhead in the non-contended case. That looks > weird. > > It would also be nice to see the 2, 3, 4 thread numbers - those are the > most common contention scenarios in practice - where do we see the first > improvement in performance? > > Also, it would be nice to include a noise/sttdev figure, it's really hard > to tell whether -1.7% is statistically significant. Ingo, I think that the optimistic spin changes to rwsem should enhance performance to real workloads after all. In my previous tests, I was doing mmap followed immediately by munmap without doing anything to the memory. No real workload will behave that way and it is not the scenario that we should optimize for. A much better approximation of real usages will be doing mmap, then touching the memories being mmaped, followed by munmap. This changes the dynamics of the rwsem as we are now dominated by read acquisitions of mmap sem due to the page faults, instead of having only write acquisitions from mmap. In this case, any delay in write acquisitions will be costly as we will be blocking a lot of readers. This is where optimistic spinning on write acquisitions of mmap sem can provide a very significant boost to the throughput. I change the test case to the following with writes to the mmaped memory: #define MEMSIZE (1 * 1024 * 1024) char *testcase_description = "Anonymous memory mmap/munmap of 1MB"; void testcase(unsigned long long *iterations) { int i; while (1) { char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); assert(c != MAP_FAILED); for (i=0; ihttp://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote: * Tim Chen tim.c.c...@linux.intel.com wrote: The throughput of pure mmap with mutex is below vs pure mmap is below: % change in performance of the mmap with pthread-mutex vs pure mmap #threadsvanilla all rwsem without optspin patches 1 3.0%-1.0% -1.7% 5 7.2%-26.8% 5.5% 10 5.2%-10.6% 22.1% 20 6.8%16.4% 12.5% 40 -0.2% 32.7% 0.0% So with mutex, the vanilla kernel and the one without optspin both run faster. This is consistent with what Peter reported. With optspin, the picture is more mixed, with lower throughput at low to moderate number of threads and higher throughput with high number of threads. So, going back to your orignal table: % change in performance of the mmap with pthread-mutex vs pure mmap #threadsvanilla all without optspin 1 3.0%-1.0% -1.7% 5 7.2%-26.8% 5.5% 10 5.2%-10.6% 22.1% 20 6.8%16.4% 12.5% 40 -0.2% 32.7% 0.0% In general, vanilla and no-optspin case perform better with pthread-mutex. For the case with optspin, mmap with pthread-mutex is worse at low to moderate contention and better at high contention. it appears that 'without optspin' appears to be a pretty good choice - if it wasn't for that '1 thread' number, which, if I correctly assume is the uncontended case, is one of the most common usecases ... How can the single-threaded case get slower? None of the patches should really cause noticeable overhead in the non-contended case. That looks weird. It would also be nice to see the 2, 3, 4 thread numbers - those are the most common contention scenarios in practice - where do we see the first improvement in performance? Also, it would be nice to include a noise/sttdev figure, it's really hard to tell whether -1.7% is statistically significant. Ingo, I think that the optimistic spin changes to rwsem should enhance performance to real workloads after all. In my previous tests, I was doing mmap followed immediately by munmap without doing anything to the memory. No real workload will behave that way and it is not the scenario that we should optimize for. A much better approximation of real usages will be doing mmap, then touching the memories being mmaped, followed by munmap. This changes the dynamics of the rwsem as we are now dominated by read acquisitions of mmap sem due to the page faults, instead of having only write acquisitions from mmap. In this case, any delay in write acquisitions will be costly as we will be blocking a lot of readers. This is where optimistic spinning on write acquisitions of mmap sem can provide a very significant boost to the throughput. I change the test case to the following with writes to the mmaped memory: #define MEMSIZE (1 * 1024 * 1024) char *testcase_description = Anonymous memory mmap/munmap of 1MB; void testcase(unsigned long long *iterations) { int i; while (1) { char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); assert(c != MAP_FAILED); for (i=0; iMEMSIZE; i+=8) { c[i] = 0xa; } munmap(c, MEMSIZE); (*iterations)++; } } I compare the throughput where I have the complete rwsem patchset against vanilla and the case where I take out the optimistic spin patch. I have increased the run time by 10x from my pervious experiments and do 10 runs for each case. The standard deviation is ~1.5% so any changes under 1.5% is statistically significant. % change in throughput vs the vanilla kernel. Threads all No-optspin 1 +0.4% -0.1% 2 +2.0% +0.2% 3 +1.1% +1.5% 4 -0.5% -1.4% 5 -0.1% -0.1% 10 +2.2% -1.2% 20 +237.3% -2.3% 40 +548.1% +0.3% For threads 1 to 5, we essentially have about the same performance as the vanilla case. We are getting a boost in throughput by 237% for 20 threads and 548% for 40 threads. Now when we take out the optimistic spin, we have mostly similar throughput as the vanilla kernel for this test. When I look at the profile of the vanilla kernel for the 40 threads case, I saw 80% of cpu time is spent contending for the spin lock of the rwsem wait queue, when rwsem_down_read_failed in page fault. When I apply the rwsem patchset with optimistic spin, this lock contention went down to only 2% of cpu time. Now when I test the case where we acquire mutex in the user space before mmap, I got the following data versus vanilla kernel. There's little contention on mmap sem acquisition in this case. n all
Re: [PATCH v8 0/9] rwsem performance optimizations
* Tim Chen wrote: > The throughput of pure mmap with mutex is below vs pure mmap is below: > > % change in performance of the mmap with pthread-mutex vs pure mmap > #threadsvanilla all rwsem without optspin > patches > 1 3.0% -1.0% -1.7% > 5 7.2% -26.8% 5.5% > 10 5.2% -10.6% 22.1% > 20 6.8% 16.4% 12.5% > 40 -0.2% 32.7% 0.0% > > So with mutex, the vanilla kernel and the one without optspin both run > faster. This is consistent with what Peter reported. With optspin, the > picture is more mixed, with lower throughput at low to moderate number > of threads and higher throughput with high number of threads. So, going back to your orignal table: > % change in performance of the mmap with pthread-mutex vs pure mmap > #threadsvanilla all without optspin > 1 3.0%-1.0% -1.7% > 5 7.2%-26.8% 5.5% > 10 5.2%-10.6% 22.1% > 20 6.8%16.4% 12.5% > 40 -0.2% 32.7% 0.0% > > In general, vanilla and no-optspin case perform better with > pthread-mutex. For the case with optspin, mmap with pthread-mutex is > worse at low to moderate contention and better at high contention. it appears that 'without optspin' appears to be a pretty good choice - if it wasn't for that '1 thread' number, which, if I correctly assume is the uncontended case, is one of the most common usecases ... How can the single-threaded case get slower? None of the patches should really cause noticeable overhead in the non-contended case. That looks weird. It would also be nice to see the 2, 3, 4 thread numbers - those are the most common contention scenarios in practice - where do we see the first improvement in performance? Also, it would be nice to include a noise/sttdev figure, it's really hard to tell whether -1.7% is statistically significant. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
* Tim Chen tim.c.c...@linux.intel.com wrote: The throughput of pure mmap with mutex is below vs pure mmap is below: % change in performance of the mmap with pthread-mutex vs pure mmap #threadsvanilla all rwsem without optspin patches 1 3.0% -1.0% -1.7% 5 7.2% -26.8% 5.5% 10 5.2% -10.6% 22.1% 20 6.8% 16.4% 12.5% 40 -0.2% 32.7% 0.0% So with mutex, the vanilla kernel and the one without optspin both run faster. This is consistent with what Peter reported. With optspin, the picture is more mixed, with lower throughput at low to moderate number of threads and higher throughput with high number of threads. So, going back to your orignal table: % change in performance of the mmap with pthread-mutex vs pure mmap #threadsvanilla all without optspin 1 3.0%-1.0% -1.7% 5 7.2%-26.8% 5.5% 10 5.2%-10.6% 22.1% 20 6.8%16.4% 12.5% 40 -0.2% 32.7% 0.0% In general, vanilla and no-optspin case perform better with pthread-mutex. For the case with optspin, mmap with pthread-mutex is worse at low to moderate contention and better at high contention. it appears that 'without optspin' appears to be a pretty good choice - if it wasn't for that '1 thread' number, which, if I correctly assume is the uncontended case, is one of the most common usecases ... How can the single-threaded case get slower? None of the patches should really cause noticeable overhead in the non-contended case. That looks weird. It would also be nice to see the 2, 3, 4 thread numbers - those are the most common contention scenarios in practice - where do we see the first improvement in performance? Also, it would be nice to include a noise/sttdev figure, it's really hard to tell whether -1.7% is statistically significant. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
On Wed, 2013-10-09 at 20:14 -0700, Linus Torvalds wrote: > On Wed, Oct 9, 2013 at 12:28 AM, Peter Zijlstra wrote: > > > > The workload that I got the report from was a virus scanner, it would > > spawn nr_cpus threads and {mmap file, scan content, munmap} through your > > filesystem. > > So I suspect we could make the mmap_sem write area *much* smaller for > the normal cases. > > Look at do_mmap_pgoff(), for example: it is run entirely under > mmap_sem, but 99% of what it does doesn't actually need the lock. > > The part that really needs the lock is > > addr = get_unmapped_area(file, addr, len, pgoff, flags); > addr = mmap_region(file, addr, len, vm_flags, pgoff); > > but we hold it over all the other stuff too. > True. By looking at the callers, we're always doing: down_write(>mmap_sem); do_mmap_pgoff() ... up_write(>mmap_sem); That goes for shm, aio, and of course mmap_pgoff(). While I know you hate two level locking, one way to go about this is to take the lock inside do_mmap_pgoff() after the initial checks (flags, page align, etc.) and return with the lock held, leaving the caller to unlock it. > In fact, even if we moved the mmap_sem down into do_mmap(), and moved > code around a bit to only hold it over those functions, it would still > cover unnecessarily much. For example, while merging is common, not > merging is pretty common too, and we do that > > vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL); > > allocation under the lock. We could easily do things like preallocate > it outside the lock. > AFAICT there are also checks that should be done at the beginning of the function, such as checking for MAP_LOCKED and VM_LOCKED flags before calling get_unmapped_area(). Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
On Wed, Oct 9, 2013 at 12:28 AM, Peter Zijlstra wrote: > > The workload that I got the report from was a virus scanner, it would > spawn nr_cpus threads and {mmap file, scan content, munmap} through your > filesystem. So I suspect we could make the mmap_sem write area *much* smaller for the normal cases. Look at do_mmap_pgoff(), for example: it is run entirely under mmap_sem, but 99% of what it does doesn't actually need the lock. The part that really needs the lock is addr = get_unmapped_area(file, addr, len, pgoff, flags); addr = mmap_region(file, addr, len, vm_flags, pgoff); but we hold it over all the other stuff too. In fact, even if we moved the mmap_sem down into do_mmap(), and moved code around a bit to only hold it over those functions, it would still cover unnecessarily much. For example, while merging is common, not merging is pretty common too, and we do that vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL); allocation under the lock. We could easily do things like preallocate it outside the lock. Right now mmap_sem covers pretty much the whole system call (we do do some security checks outside of it). I think the main issue is that nobody has ever cared deeply enough to see how far this could be pushed. I suspect there is some low-hanging fruit for anybody who is willing to handle the pain.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
On Wed, 2013-10-09 at 08:15 +0200, Ingo Molnar wrote: > * Tim Chen wrote: > > > Ingo, > > > > I ran the vanilla kernel, the kernel with all rwsem patches and the > > kernel with all patches except the optimistic spin one. I am listing > > two presentations of the data. Please note that there is about 5% > > run-run variation. > > > > % change in performance vs vanilla kernel > > #threadsall without optspin > > mmap only > > 1 1.9%1.6% > > 5 43.8% 2.6% > > 10 22.7% -3.0% > > 20 -12.0% -4.5% > > 40 -26.9% -2.0% > > mmap with mutex acquisition > > 1 -2.1% -3.0% > > 5 -1.9% 1.0% > > 10 4.2%12.5% > > 20 -4.1% 0.6% > > 40 -2.8% -1.9% > > Silly question: how do the two methods of starting N threads compare to > each other? They both started N pthreads and run for a fixed time. The throughput of pure mmap with mutex is below vs pure mmap is below: % change in performance of the mmap with pthread-mutex vs pure mmap #threadsvanilla all rwsem without optspin patches 1 3.0%-1.0% -1.7% 5 7.2%-26.8% 5.5% 10 5.2%-10.6% 22.1% 20 6.8%16.4% 12.5% 40 -0.2% 32.7% 0.0% So with mutex, the vanilla kernel and the one without optspin both run faster. This is consistent with what Peter reported. With optspin, the picture is more mixed, with lower throughput at low to moderate number of threads and higher throughput with high number of threads. > Do they have identical runtimes? Yes, they both have identical runtimes. I look at the number of mmap and munmap operations I could push through. > I think PeterZ's point was > that the pthread_mutex case, despite adding extra serialization, actually > runs faster in some circumstances. Yes, I also see the pthread mutex run faster for the vanilla kernel from the data above. > > Also, mind posting the testcase? What 'work' do the threads do - clear > some memory area? The test case do simple mmap and munmap 1MB memory per iteration. > How big is the memory area? 1MB The two cases are created as: #define MEMSIZE (1 * 1024 * 1024) char *testcase_description = "Anonymous memory mmap/munmap of 1MB"; void testcase(unsigned long long *iterations) { while (1) { char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); assert(c != MAP_FAILED); munmap(c, MEMSIZE); (*iterations)++; } } and adding mutex to serialize: #define MEMSIZE (1 * 1024 * 1024) char *testcase_description = "Anonymous memory mmap/munmap of 1MB with mutex"; pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; void testcase(unsigned long long *iterations) { while (1) { pthread_mutex_lock(); char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); assert(c != MAP_FAILED); munmap(c, MEMSIZE); pthread_mutex_unlock(); (*iterations)++; } } and run as a pthread. > > I'd expect this to be about large enough mmap()s showing page fault > processing to be mmap_sem bound and the serialization via pthread_mutex() > sets up a 'train' of threads in one case, while the parallel startup would > run into the mmap_sem in the regular case. > > So I'd expect this to be a rather sensitive workload and you'd have to > actively engineer it to hit the effect PeterZ mentioned. I could imagine > MPI workloads to run into such patterns - but not deterministically. > > Only once you've convinced yourself that you are hitting that kind of > effect reliably on the vanilla kernel, could/should the effects of an > improved rwsem implementation be measured. > > Thanks, > > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
On Wed, Oct 09, 2013 at 08:15:51AM +0200, Ingo Molnar wrote: > So I'd expect this to be a rather sensitive workload and you'd have to > actively engineer it to hit the effect PeterZ mentioned. I could imagine > MPI workloads to run into such patterns - but not deterministically. The workload that I got the report from was a virus scanner, it would spawn nr_cpus threads and {mmap file, scan content, munmap} through your filesystem. Now if I only could remember who reported this.. :/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
* Tim Chen wrote: > Ingo, > > I ran the vanilla kernel, the kernel with all rwsem patches and the > kernel with all patches except the optimistic spin one. I am listing > two presentations of the data. Please note that there is about 5% > run-run variation. > > % change in performance vs vanilla kernel > #threads all without optspin > mmap only > 1 1.9%1.6% > 5 43.8% 2.6% > 1022.7% -3.0% > 20-12.0% -4.5% > 40-26.9% -2.0% > mmap with mutex acquisition > 1 -2.1% -3.0% > 5 -1.9% 1.0% > 104.2%12.5% > 20-4.1% 0.6% > 40-2.8% -1.9% Silly question: how do the two methods of starting N threads compare to each other? Do they have identical runtimes? I think PeterZ's point was that the pthread_mutex case, despite adding extra serialization, actually runs faster in some circumstances. Also, mind posting the testcase? What 'work' do the threads do - clear some memory area? How big is the memory area? I'd expect this to be about large enough mmap()s showing page fault processing to be mmap_sem bound and the serialization via pthread_mutex() sets up a 'train' of threads in one case, while the parallel startup would run into the mmap_sem in the regular case. So I'd expect this to be a rather sensitive workload and you'd have to actively engineer it to hit the effect PeterZ mentioned. I could imagine MPI workloads to run into such patterns - but not deterministically. Only once you've convinced yourself that you are hitting that kind of effect reliably on the vanilla kernel, could/should the effects of an improved rwsem implementation be measured. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
* Tim Chen tim.c.c...@linux.intel.com wrote: Ingo, I ran the vanilla kernel, the kernel with all rwsem patches and the kernel with all patches except the optimistic spin one. I am listing two presentations of the data. Please note that there is about 5% run-run variation. % change in performance vs vanilla kernel #threads all without optspin mmap only 1 1.9%1.6% 5 43.8% 2.6% 1022.7% -3.0% 20-12.0% -4.5% 40-26.9% -2.0% mmap with mutex acquisition 1 -2.1% -3.0% 5 -1.9% 1.0% 104.2%12.5% 20-4.1% 0.6% 40-2.8% -1.9% Silly question: how do the two methods of starting N threads compare to each other? Do they have identical runtimes? I think PeterZ's point was that the pthread_mutex case, despite adding extra serialization, actually runs faster in some circumstances. Also, mind posting the testcase? What 'work' do the threads do - clear some memory area? How big is the memory area? I'd expect this to be about large enough mmap()s showing page fault processing to be mmap_sem bound and the serialization via pthread_mutex() sets up a 'train' of threads in one case, while the parallel startup would run into the mmap_sem in the regular case. So I'd expect this to be a rather sensitive workload and you'd have to actively engineer it to hit the effect PeterZ mentioned. I could imagine MPI workloads to run into such patterns - but not deterministically. Only once you've convinced yourself that you are hitting that kind of effect reliably on the vanilla kernel, could/should the effects of an improved rwsem implementation be measured. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
On Wed, Oct 09, 2013 at 08:15:51AM +0200, Ingo Molnar wrote: So I'd expect this to be a rather sensitive workload and you'd have to actively engineer it to hit the effect PeterZ mentioned. I could imagine MPI workloads to run into such patterns - but not deterministically. The workload that I got the report from was a virus scanner, it would spawn nr_cpus threads and {mmap file, scan content, munmap} through your filesystem. Now if I only could remember who reported this.. :/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
On Wed, 2013-10-09 at 08:15 +0200, Ingo Molnar wrote: * Tim Chen tim.c.c...@linux.intel.com wrote: Ingo, I ran the vanilla kernel, the kernel with all rwsem patches and the kernel with all patches except the optimistic spin one. I am listing two presentations of the data. Please note that there is about 5% run-run variation. % change in performance vs vanilla kernel #threadsall without optspin mmap only 1 1.9%1.6% 5 43.8% 2.6% 10 22.7% -3.0% 20 -12.0% -4.5% 40 -26.9% -2.0% mmap with mutex acquisition 1 -2.1% -3.0% 5 -1.9% 1.0% 10 4.2%12.5% 20 -4.1% 0.6% 40 -2.8% -1.9% Silly question: how do the two methods of starting N threads compare to each other? They both started N pthreads and run for a fixed time. The throughput of pure mmap with mutex is below vs pure mmap is below: % change in performance of the mmap with pthread-mutex vs pure mmap #threadsvanilla all rwsem without optspin patches 1 3.0%-1.0% -1.7% 5 7.2%-26.8% 5.5% 10 5.2%-10.6% 22.1% 20 6.8%16.4% 12.5% 40 -0.2% 32.7% 0.0% So with mutex, the vanilla kernel and the one without optspin both run faster. This is consistent with what Peter reported. With optspin, the picture is more mixed, with lower throughput at low to moderate number of threads and higher throughput with high number of threads. Do they have identical runtimes? Yes, they both have identical runtimes. I look at the number of mmap and munmap operations I could push through. I think PeterZ's point was that the pthread_mutex case, despite adding extra serialization, actually runs faster in some circumstances. Yes, I also see the pthread mutex run faster for the vanilla kernel from the data above. Also, mind posting the testcase? What 'work' do the threads do - clear some memory area? The test case do simple mmap and munmap 1MB memory per iteration. How big is the memory area? 1MB The two cases are created as: #define MEMSIZE (1 * 1024 * 1024) char *testcase_description = Anonymous memory mmap/munmap of 1MB; void testcase(unsigned long long *iterations) { while (1) { char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); assert(c != MAP_FAILED); munmap(c, MEMSIZE); (*iterations)++; } } and adding mutex to serialize: #define MEMSIZE (1 * 1024 * 1024) char *testcase_description = Anonymous memory mmap/munmap of 1MB with mutex; pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; void testcase(unsigned long long *iterations) { while (1) { pthread_mutex_lock(mutex); char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); assert(c != MAP_FAILED); munmap(c, MEMSIZE); pthread_mutex_unlock(mutex); (*iterations)++; } } and run as a pthread. I'd expect this to be about large enough mmap()s showing page fault processing to be mmap_sem bound and the serialization via pthread_mutex() sets up a 'train' of threads in one case, while the parallel startup would run into the mmap_sem in the regular case. So I'd expect this to be a rather sensitive workload and you'd have to actively engineer it to hit the effect PeterZ mentioned. I could imagine MPI workloads to run into such patterns - but not deterministically. Only once you've convinced yourself that you are hitting that kind of effect reliably on the vanilla kernel, could/should the effects of an improved rwsem implementation be measured. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
On Wed, Oct 9, 2013 at 12:28 AM, Peter Zijlstra pet...@infradead.org wrote: The workload that I got the report from was a virus scanner, it would spawn nr_cpus threads and {mmap file, scan content, munmap} through your filesystem. So I suspect we could make the mmap_sem write area *much* smaller for the normal cases. Look at do_mmap_pgoff(), for example: it is run entirely under mmap_sem, but 99% of what it does doesn't actually need the lock. The part that really needs the lock is addr = get_unmapped_area(file, addr, len, pgoff, flags); addr = mmap_region(file, addr, len, vm_flags, pgoff); but we hold it over all the other stuff too. In fact, even if we moved the mmap_sem down into do_mmap(), and moved code around a bit to only hold it over those functions, it would still cover unnecessarily much. For example, while merging is common, not merging is pretty common too, and we do that vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL); allocation under the lock. We could easily do things like preallocate it outside the lock. Right now mmap_sem covers pretty much the whole system call (we do do some security checks outside of it). I think the main issue is that nobody has ever cared deeply enough to see how far this could be pushed. I suspect there is some low-hanging fruit for anybody who is willing to handle the pain.. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
On Wed, 2013-10-09 at 20:14 -0700, Linus Torvalds wrote: On Wed, Oct 9, 2013 at 12:28 AM, Peter Zijlstra pet...@infradead.org wrote: The workload that I got the report from was a virus scanner, it would spawn nr_cpus threads and {mmap file, scan content, munmap} through your filesystem. So I suspect we could make the mmap_sem write area *much* smaller for the normal cases. Look at do_mmap_pgoff(), for example: it is run entirely under mmap_sem, but 99% of what it does doesn't actually need the lock. The part that really needs the lock is addr = get_unmapped_area(file, addr, len, pgoff, flags); addr = mmap_region(file, addr, len, vm_flags, pgoff); but we hold it over all the other stuff too. True. By looking at the callers, we're always doing: down_write(mm-mmap_sem); do_mmap_pgoff() ... up_write(mm-mmap_sem); That goes for shm, aio, and of course mmap_pgoff(). While I know you hate two level locking, one way to go about this is to take the lock inside do_mmap_pgoff() after the initial checks (flags, page align, etc.) and return with the lock held, leaving the caller to unlock it. In fact, even if we moved the mmap_sem down into do_mmap(), and moved code around a bit to only hold it over those functions, it would still cover unnecessarily much. For example, while merging is common, not merging is pretty common too, and we do that vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL); allocation under the lock. We could easily do things like preallocate it outside the lock. AFAICT there are also checks that should be done at the beginning of the function, such as checking for MAP_LOCKED and VM_LOCKED flags before calling get_unmapped_area(). Thanks, Davidlohr -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
On Thu, 2013-10-03 at 09:32 +0200, Ingo Molnar wrote: > * Tim Chen wrote: > > > For version 8 of the patchset, we included the patch from Waiman to > > streamline wakeup operations and also optimize the MCS lock used in > > rwsem and mutex. > > I'd be feeling a lot easier about this patch series if you also had > performance figures that show how mmap_sem is affected. > > These: > > > Tim got the following improvement for exim mail server > > workload on 40 core system: > > > > Alex+Tim's patchset: +4.8% > > Alex+Tim+Waiman's patchset:+5.3% > > appear to be mostly related to the anon_vma->rwsem. But once that lock is > changed to an rwlock_t, this measurement falls away. > > Peter Zijlstra suggested the following testcase: > > ===> > In fact, try something like this from userspace: > > n-threads: > > pthread_mutex_lock(); > foo = mmap(); > pthread_mutex_lock(); > > /* work */ > > pthread_mutex_unlock(); > munma(foo); > pthread_mutex_unlock(); > > vs > > n-threads: > > foo = mmap(); > /* work */ > munmap(foo); Ingo, I ran the vanilla kernel, the kernel with all rwsem patches and the kernel with all patches except the optimistic spin one. I am listing two presentations of the data. Please note that there is about 5% run-run variation. % change in performance vs vanilla kernel #threadsall without optspin mmap only 1 1.9%1.6% 5 43.8% 2.6% 10 22.7% -3.0% 20 -12.0% -4.5% 40 -26.9% -2.0% mmap with mutex acquisition 1 -2.1% -3.0% 5 -1.9% 1.0% 10 4.2%12.5% 20 -4.1% 0.6% 40 -2.8% -1.9% The optimistic spin case does very well at low to moderate contentions, but worse when there are very heavy contentions for the pure mmap case. For the case with pthread mutex, there's not much change from vanilla kernel. % change in performance of the mmap with pthread-mutex vs pure mmap #threadsvanilla all without optspin 1 3.0%-1.0% -1.7% 5 7.2%-26.8% 5.5% 10 5.2%-10.6% 22.1% 20 6.8%16.4% 12.5% 40 -0.2% 32.7% 0.0% In general, vanilla and no-optspin case perform better with pthread-mutex. For the case with optspin, mmap with pthread-mutex is worse at low to moderate contention and better at high contention. Tim > > I've had reports that the former was significantly faster than the > latter. > <=== > > this could be put into a standalone testcase, or you could add it as a new > subcommand of 'perf bench', which already has some pthread code, see for > example in tools/perf/bench/sched-messaging.c. Adding: > >perf bench mm threads > > or so would be a natural thing to have. > > Thanks, > > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
On Thu, 2013-10-03 at 09:32 +0200, Ingo Molnar wrote: * Tim Chen tim.c.c...@linux.intel.com wrote: For version 8 of the patchset, we included the patch from Waiman to streamline wakeup operations and also optimize the MCS lock used in rwsem and mutex. I'd be feeling a lot easier about this patch series if you also had performance figures that show how mmap_sem is affected. These: Tim got the following improvement for exim mail server workload on 40 core system: Alex+Tim's patchset: +4.8% Alex+Tim+Waiman's patchset:+5.3% appear to be mostly related to the anon_vma-rwsem. But once that lock is changed to an rwlock_t, this measurement falls away. Peter Zijlstra suggested the following testcase: === In fact, try something like this from userspace: n-threads: pthread_mutex_lock(mutex); foo = mmap(); pthread_mutex_lock(mutex); /* work */ pthread_mutex_unlock(mutex); munma(foo); pthread_mutex_unlock(mutex); vs n-threads: foo = mmap(); /* work */ munmap(foo); Ingo, I ran the vanilla kernel, the kernel with all rwsem patches and the kernel with all patches except the optimistic spin one. I am listing two presentations of the data. Please note that there is about 5% run-run variation. % change in performance vs vanilla kernel #threadsall without optspin mmap only 1 1.9%1.6% 5 43.8% 2.6% 10 22.7% -3.0% 20 -12.0% -4.5% 40 -26.9% -2.0% mmap with mutex acquisition 1 -2.1% -3.0% 5 -1.9% 1.0% 10 4.2%12.5% 20 -4.1% 0.6% 40 -2.8% -1.9% The optimistic spin case does very well at low to moderate contentions, but worse when there are very heavy contentions for the pure mmap case. For the case with pthread mutex, there's not much change from vanilla kernel. % change in performance of the mmap with pthread-mutex vs pure mmap #threadsvanilla all without optspin 1 3.0%-1.0% -1.7% 5 7.2%-26.8% 5.5% 10 5.2%-10.6% 22.1% 20 6.8%16.4% 12.5% 40 -0.2% 32.7% 0.0% In general, vanilla and no-optspin case perform better with pthread-mutex. For the case with optspin, mmap with pthread-mutex is worse at low to moderate contention and better at high contention. Tim I've had reports that the former was significantly faster than the latter. === this could be put into a standalone testcase, or you could add it as a new subcommand of 'perf bench', which already has some pthread code, see for example in tools/perf/bench/sched-messaging.c. Adding: perf bench mm threads or so would be a natural thing to have. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
* Tim Chen wrote: > For version 8 of the patchset, we included the patch from Waiman to > streamline wakeup operations and also optimize the MCS lock used in > rwsem and mutex. I'd be feeling a lot easier about this patch series if you also had performance figures that show how mmap_sem is affected. These: > Tim got the following improvement for exim mail server > workload on 40 core system: > > Alex+Tim's patchset: +4.8% > Alex+Tim+Waiman's patchset:+5.3% appear to be mostly related to the anon_vma->rwsem. But once that lock is changed to an rwlock_t, this measurement falls away. Peter Zijlstra suggested the following testcase: ===> In fact, try something like this from userspace: n-threads: pthread_mutex_lock(); foo = mmap(); pthread_mutex_lock(); /* work */ pthread_mutex_unlock(); munma(foo); pthread_mutex_unlock(); vs n-threads: foo = mmap(); /* work */ munmap(foo); I've had reports that the former was significantly faster than the latter. <=== this could be put into a standalone testcase, or you could add it as a new subcommand of 'perf bench', which already has some pthread code, see for example in tools/perf/bench/sched-messaging.c. Adding: perf bench mm threads or so would be a natural thing to have. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
* Tim Chen tim.c.c...@linux.intel.com wrote: For version 8 of the patchset, we included the patch from Waiman to streamline wakeup operations and also optimize the MCS lock used in rwsem and mutex. I'd be feeling a lot easier about this patch series if you also had performance figures that show how mmap_sem is affected. These: Tim got the following improvement for exim mail server workload on 40 core system: Alex+Tim's patchset: +4.8% Alex+Tim+Waiman's patchset:+5.3% appear to be mostly related to the anon_vma-rwsem. But once that lock is changed to an rwlock_t, this measurement falls away. Peter Zijlstra suggested the following testcase: === In fact, try something like this from userspace: n-threads: pthread_mutex_lock(mutex); foo = mmap(); pthread_mutex_lock(mutex); /* work */ pthread_mutex_unlock(mutex); munma(foo); pthread_mutex_unlock(mutex); vs n-threads: foo = mmap(); /* work */ munmap(foo); I've had reports that the former was significantly faster than the latter. === this could be put into a standalone testcase, or you could add it as a new subcommand of 'perf bench', which already has some pthread code, see for example in tools/perf/bench/sched-messaging.c. Adding: perf bench mm threads or so would be a natural thing to have. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v8 0/9] rwsem performance optimizations
For version 8 of the patchset, we included the patch from Waiman to streamline wakeup operations and also optimize the MCS lock used in rwsem and mutex. In this patchset, we introduce three categories of optimizations to read write semaphore. The first four patches from Alex Shi reduce cache bouncing of the sem->count field by doing a pre-read of the sem->count and avoid cmpxchg if possible. The next four patches from Tim, Davidlohr and Jason introduce optimistic spinning logic similar to that in the mutex code for the writer lock acquisition of rwsem. This addresses the general 'mutexes out perform writer-rwsems' situations that has been seen in more than one case. Users now need not worry about performance issues when choosing between these two locking mechanisms. We have also factored out the MCS lock originally in the mutex code into its own file, and performed micro optimizations and corrected the memory barriers so it could be used for general lock/unlock of critical sections. The last patch from Waiman help to streamline the wake up operation by avoiding multiple threads all doing wakeup operations when only one wakeup thread is enough. This significantly reduced lock contentions from multiple wakeup threads. Tim got the following improvement for exim mail server workload on 40 core system: Alex+Tim's patchset: +4.8% Alex+Tim+Waiman's patchset:+5.3% Without these optimizations, Davidlohr Bueso saw a -8% regression to aim7's shared and high_systime workloads when he switched i_mmap_mutex to rwsem. Tests were on 8 socket 80 cores system. With Alex and Tim's patches, he got significant improvements to the aim7 suite instead of regressions: alltests (+16.3%), custom (+20%), disk (+19.5%), high_systime (+7%), shared (+18.4%) and short (+6.3%). More Aim7 numbers will be posted when Davidlohr has a chance to test the complete patchset including Waiman's patch. Thanks to Ingo Molnar, Peter Hurley, Peter Zijlstra and Paul McKenney for helping to review this patchset. Tim Changelog: v8: 1. Added Waiman's patch to avoid multiple wakeup thread lock contention. 2. Micro-optimizations of MCS lock. 3. Correct the barriers of MCS lock to prevent critical sections from leaking. v7: 1. Rename mcslock.h to mcs_spinlock.h and also rename mcs related fields with mcs prefix. 2. Properly define type of *mcs_lock field instead of leaving it as *void. 3. Added breif explanation of mcs lock. v6: 1. Fix missing mcslock.h file. 2. Fix various code style issues. v5: 1. Try optimistic spinning before we put the writer on the wait queue to avoid bottlenecking at wait queue. This provides 5% boost to exim workload and between 2% to 8% boost to aim7. 2. Put MCS locking code into its own mcslock.h file for better reuse between mutex.c and rwsem.c 3. Remove the configuration RWSEM_SPIN_ON_WRITE_OWNER and make the operations default per Ingo's suggestions. v4: 1. Fixed a bug in task_struct definition in rwsem_can_spin_on_owner 2. Fix another typo for RWSEM_SPIN_ON_WRITE_OWNER config option v3: 1. Added ACCESS_ONCE to sem->count access in rwsem_can_spin_on_owner. 2. Fix typo bug for RWSEM_SPIN_ON_WRITE_OWNER option in init/Kconfig v2: 1. Reorganize changes to down_write_trylock and do_wake into 4 patches and fixed a bug referencing >count when sem->count is intended. 2. Fix unsafe sem->owner de-reference in rwsem_can_spin_on_owner. the option to be on for more seasoning but can be turned off should it be detrimental. 3. Various patch comments update Alex Shi (4): rwsem: check the lock before cpmxchg in down_write_trylock rwsem: remove 'out' label in do_wake rwsem: remove try_reader_grant label do_wake rwsem/wake: check lock before do atomic update Jason Low (2): MCS Lock: optimizations and extra comments MCS Lock: Barrier corrections Tim Chen (2): MCS Lock: Restructure the MCS lock defines and locking code into its own file rwsem: do optimistic spinning for writer lock acquisition Waiman Long (1): rwsem: reduce spinlock contention in wakeup code path include/asm-generic/rwsem.h |8 +- include/linux/mcs_spinlock.h | 82 ++ include/linux/mutex.h|5 +- include/linux/rwsem.h|9 ++- kernel/mutex.c | 60 +- kernel/rwsem.c | 19 +++- lib/rwsem.c | 255 +- 7 files changed, 349 insertions(+), 89 deletions(-) create mode 100644 include/linux/mcs_spinlock.h -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v8 0/9] rwsem performance optimizations
For version 8 of the patchset, we included the patch from Waiman to streamline wakeup operations and also optimize the MCS lock used in rwsem and mutex. In this patchset, we introduce three categories of optimizations to read write semaphore. The first four patches from Alex Shi reduce cache bouncing of the sem-count field by doing a pre-read of the sem-count and avoid cmpxchg if possible. The next four patches from Tim, Davidlohr and Jason introduce optimistic spinning logic similar to that in the mutex code for the writer lock acquisition of rwsem. This addresses the general 'mutexes out perform writer-rwsems' situations that has been seen in more than one case. Users now need not worry about performance issues when choosing between these two locking mechanisms. We have also factored out the MCS lock originally in the mutex code into its own file, and performed micro optimizations and corrected the memory barriers so it could be used for general lock/unlock of critical sections. The last patch from Waiman help to streamline the wake up operation by avoiding multiple threads all doing wakeup operations when only one wakeup thread is enough. This significantly reduced lock contentions from multiple wakeup threads. Tim got the following improvement for exim mail server workload on 40 core system: Alex+Tim's patchset: +4.8% Alex+Tim+Waiman's patchset:+5.3% Without these optimizations, Davidlohr Bueso saw a -8% regression to aim7's shared and high_systime workloads when he switched i_mmap_mutex to rwsem. Tests were on 8 socket 80 cores system. With Alex and Tim's patches, he got significant improvements to the aim7 suite instead of regressions: alltests (+16.3%), custom (+20%), disk (+19.5%), high_systime (+7%), shared (+18.4%) and short (+6.3%). More Aim7 numbers will be posted when Davidlohr has a chance to test the complete patchset including Waiman's patch. Thanks to Ingo Molnar, Peter Hurley, Peter Zijlstra and Paul McKenney for helping to review this patchset. Tim Changelog: v8: 1. Added Waiman's patch to avoid multiple wakeup thread lock contention. 2. Micro-optimizations of MCS lock. 3. Correct the barriers of MCS lock to prevent critical sections from leaking. v7: 1. Rename mcslock.h to mcs_spinlock.h and also rename mcs related fields with mcs prefix. 2. Properly define type of *mcs_lock field instead of leaving it as *void. 3. Added breif explanation of mcs lock. v6: 1. Fix missing mcslock.h file. 2. Fix various code style issues. v5: 1. Try optimistic spinning before we put the writer on the wait queue to avoid bottlenecking at wait queue. This provides 5% boost to exim workload and between 2% to 8% boost to aim7. 2. Put MCS locking code into its own mcslock.h file for better reuse between mutex.c and rwsem.c 3. Remove the configuration RWSEM_SPIN_ON_WRITE_OWNER and make the operations default per Ingo's suggestions. v4: 1. Fixed a bug in task_struct definition in rwsem_can_spin_on_owner 2. Fix another typo for RWSEM_SPIN_ON_WRITE_OWNER config option v3: 1. Added ACCESS_ONCE to sem-count access in rwsem_can_spin_on_owner. 2. Fix typo bug for RWSEM_SPIN_ON_WRITE_OWNER option in init/Kconfig v2: 1. Reorganize changes to down_write_trylock and do_wake into 4 patches and fixed a bug referencing sem-count when sem-count is intended. 2. Fix unsafe sem-owner de-reference in rwsem_can_spin_on_owner. the option to be on for more seasoning but can be turned off should it be detrimental. 3. Various patch comments update Alex Shi (4): rwsem: check the lock before cpmxchg in down_write_trylock rwsem: remove 'out' label in do_wake rwsem: remove try_reader_grant label do_wake rwsem/wake: check lock before do atomic update Jason Low (2): MCS Lock: optimizations and extra comments MCS Lock: Barrier corrections Tim Chen (2): MCS Lock: Restructure the MCS lock defines and locking code into its own file rwsem: do optimistic spinning for writer lock acquisition Waiman Long (1): rwsem: reduce spinlock contention in wakeup code path include/asm-generic/rwsem.h |8 +- include/linux/mcs_spinlock.h | 82 ++ include/linux/mutex.h|5 +- include/linux/rwsem.h|9 ++- kernel/mutex.c | 60 +- kernel/rwsem.c | 19 +++- lib/rwsem.c | 255 +- 7 files changed, 349 insertions(+), 89 deletions(-) create mode 100644 include/linux/mcs_spinlock.h -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/