Re: SSE in libthr
Below is an updated patch to incorporate everyone's feedback so far. I recognize all of the counter-arguments, and I agree with them in general. Indeed, as applications use more SIMD, this kind of patch goes in the wrong direction. However, there are applications that do not use enough SSE to offset the extra context-switch cost. SSE does not provide a clear benefit in the current libthr code with the current compiler, but it does provide a clear loss in some cases. Therefore, disabling SSE in libthr is a non-loss for most, and a gain for some. I refrained from disabling SSE in libc--as was suggested--because I can't make the above argument for libc. It provides such a variety of code that SSE might be a net win in some cases. I wish I had time to identify and benchmark the interesting cases. Thanks in advance for your further review and comments. Eric Index: head/lib/libthr/arch/amd64/Makefile.inc === --- head/lib/libthr/arch/amd64/Makefile.inc (revision 281473) +++ head/lib/libthr/arch/amd64/Makefile.inc (working copy) @@ -1,3 +1,9 @@ #$FreeBSD$ SRCS+= _umtx_op_err.S + +# With the current compiler and libthr code, using SSE in libthr +# does not provide enough performance improvement to outweigh +# the extra context switch cost. This can measurably impact +# performance when the application also does not use enough SSE. +CFLAGS+=${CFLAGS_NO_SIMD} Index: head/lib/libthr/arch/i386/Makefile.inc === --- head/lib/libthr/arch/i386/Makefile.inc (revision 281473) +++ head/lib/libthr/arch/i386/Makefile.inc (working copy) @@ -1,3 +1,9 @@ # $FreeBSD$ SRCS+= _umtx_op_err.S + +# With the current compiler and libthr code, using SSE in libthr +# does not provide enough performance improvement to outweigh +# the extra context switch cost. This can measurably impact +# performance when the application also does not use enough SSE. +CFLAGS+=${CFLAGS_NO_SIMD} Index: head/libexec/rtld-elf/amd64/Makefile.inc === --- head/libexec/rtld-elf/amd64/Makefile.inc(revision 281473) +++ head/libexec/rtld-elf/amd64/Makefile.inc(working copy) @@ -1,6 +1,6 @@ # $FreeBSD$ -CFLAGS+= -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 -msoft-float +CFLAGS+= ${CFLAGS_NO_SIMD} -msoft-float # Uncomment this to build the dynamic linker as an executable instead # of a shared library: #LDSCRIPT= ${.CURDIR}/${MACHINE_CPUARCH}/elf_rtld.x Index: head/libexec/rtld-elf/i386/Makefile.inc === --- head/libexec/rtld-elf/i386/Makefile.inc (revision 281473) +++ head/libexec/rtld-elf/i386/Makefile.inc (working copy) @@ -1,6 +1,6 @@ # $FreeBSD$ -CFLAGS+= -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 -msoft-float +CFLAGS+= ${CFLAGS_NO_SIMD} -msoft-float # Uncomment this to build the dynamic linker as an executable instead # of a shared library: #LDSCRIPT= ${.CURDIR}/${MACHINE_CPUARCH}/elf_rtld.x Index: head/share/mk/bsd.sys.mk === --- head/share/mk/bsd.sys.mk(revision 281473) +++ head/share/mk/bsd.sys.mk(working copy) @@ -153,6 +153,26 @@ SSP_CFLAGS?= -fstack-protector CFLAGS+= ${SSP_CFLAGS} .endif # SSP !ARM !MIPS +# +# Prohibit the compiler from emitting SIMD instructions. +# These flags are added to CFLAGS in areas where the extra context-switch +# cost outweighs the advantages of SIMD instructions. +# +# gcc: +# Setting -mno-mmx implies -mno-3dnow +# Setting -mno-sse implies -mno-sse2, -mno-sse3, -mno-ssse3 and -mfpmath=387 +# +# clang: +# Setting -mno-mmx implies -mno-3dnow and -mno-3dnowa +# Setting -mno-sse implies -mno-sse2, -mno-sse3, -mno-ssse3, -mno-sse41 and +# -mno-sse42 +# (-mfpmath= is not supported) +# +.if ${MACHINE_CPUARCH} == i386 || ${MACHINE_CPUARCH} == amd64 +CFLAGS_NO_SIMD.clang= -mno-avx +CFLAGS_NO_SIMD=-mno-mmx -mno-sse ${CFLAGS_NO_SIMD.${COMPILER_TYPE}} +.endif + # Allow user-specified additional warning flags, plus compiler specific flag overrides. # Unless we've overriden this... .if ${MK_WARNS} != no Index: head/sys/conf/kern.mk === --- head/sys/conf/kern.mk (revision 281473) +++ head/sys/conf/kern.mk (working copy) @@ -75,18 +75,10 @@ FORMAT_EXTENSIONS= -fformat-extensions # operations inside the kernel itself. These operations are exclusively # reserved for user applications. # -# gcc: -# Setting -mno-mmx implies -mno-3dnow -# Setting -mno-sse implies -mno-sse2, -mno-sse3 and -mno-ssse3 -# -# clang: -# Setting -mno-mmx implies -mno-3dnow and -mno-3dnowa -# Setting -mno-sse implies -mno-sse2, -mno-sse3, -mno-ssse3, -mno-sse41 and -mno-sse42 -# .if ${MACHINE_CPUARCH} == i386 CFLAGS.gcc+= -mno-align-long
Re: SSE in libthr
On Saturday, March 28, 2015 10:41:48 AM Adrian Chadd wrote: Ok, so how do we reduce the amount of FPU save and restores, or make them cheaper? Or make them more useful. If you are using SSE/AVX more often between context switches in ways that are beneficial then that might offset the cost of the save and restore and result in a net win. I have variants of strlen, memcpy, and memset that use SSE. However, microbenchmarks aren't super useful as you have noted. If you would like to try these out in some real workloads I can provide a patch to libc. -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: SSE in libthr
On 3/28/15 5:44 AM, Konstantin Belousov wrote: On Fri, Mar 27, 2015 at 01:49:03PM -0700, Rui Paulo wrote: On Mar 27, 2015, at 12:26, Eric van Gyzen vangy...@freebsd.org wrote: In a nutshell: Clang emits SSE instructions on amd64 in the common path of pthread_mutex_unlock. This reduces performance by a non-trivial amount. I'd like to disable SSE in libthr. In more detail: In libthr/thread/thr_mutex.c, we find the following: #define MUTEX_INIT_LINK(m) do {\ (m)-m_qe.tqe_prev = NULL; \ (m)-m_qe.tqe_next = NULL; \ } while (0) In 9.1, clang 3.1 emits two ordinary mov instructions: movq $0x0,0x8(%rax) movq $0x0,(%rax) Since 10.0 and clang 3.3, clang emits these SSE instructions: xorps %xmm0,%xmm0 movups %xmm0,(%rax) Although these look harmless enough, using the FPU can reduce performance by incurring extra overhead due to context-switching the FPU state. As I mentioned, this code is used in the common path of pthread_mutex_unlock. I have a simple test program that creates four threads, all contending for a single mutex, and measures the total number of lock acquisitions over several seconds. When libthr is built with SSE, as is current, I get around 53 million locks in 5 seconds. Without SSE, I get around 60 million (13% more). DTrace shows around 790,000 calls to fpudna versus 10 calls. There could be other factors involved, but I presume that the FPU context switches account for most of the change in performance. Even when I add some SSE usage in the application--incidentally, these same instructions--building libthr without SSE improves performance from 53.5 million to 55.8 million (4.3%). In the real-world application where I first noticed this, performance improves by 3-5%. I would appreciate your thoughts and feedback. The proposed patch is below. Eric Index: base/head/lib/libthr/arch/amd64/Makefile.inc === --- base/head/lib/libthr/arch/amd64/Makefile.inc(revision 280703) +++ base/head/lib/libthr/arch/amd64/Makefile.inc(working copy) @@ -1,3 +1,8 @@ #$FreeBSD$ SRCS+= _umtx_op_err.S + +# Using SSE incurs extra overhead per context switch, +# which measurably impacts performance when the application +# does not otherwise use FP/SSE. +CFLAGS+=-mno-sse Good catch! Regarding your patch, I think we should disable even more, if possible. How about: CFLAGS+=-mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 I think so. Also, this should be done for libc as well, both on i386 and amd64. I am not sure, should compiler-rt be included into the set ? the point is that clang will do this anywhere it can, because it isn't taking into account the side effects, just the speed of the commands themselves. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: SSE in libthr
On 28 Mar 2015, at 13:54, Julian Elischer jul...@freebsd.org wrote: the point is that clang will do this anywhere it can, because it isn't taking into account the side effects, just the speed of the commands themselves. This is also something that is not going to decrease. Clang now enables the SLP vectoriser by default and this code is constantly being improved. Current generation vector units are explicitly designed as targets for compiler autovectorisation, not for hand-tuned DSP code (which, increasingly, runs on the GPU anyway). This means that we're increasingly going to see SSE/AVX/NEON usage in CPU-bound code, even without an explicit programmer decision to do so. Optimising for the case when the vector unit is not used is about as sensible as optimising for the single-core case: it will affect some people, but generally not those who care about performance, and a decreasing number of people over time. David ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: SSE in libthr
Eric van Gyzen wrote this message on Fri, Mar 27, 2015 at 17:43 -0400: On 03/27/2015 16:49, Rui Paulo wrote: Regarding your patch, I think we should disable even more, if possible. How about: CFLAGS+=-mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 Yes, I was considering copying all of the similar flags that we use in the kernel. That seems wise. According to comments in sys/conf/kern.mk, only no-mmx and no-sse would be necessary, as they imply the others. dim@ raised the possibility of CPUTYPE=foo on i386, so I would also apply this change to i386. An updated patch is below. We should probably add a $(CFLAGS_NOFPU) define and use that.. Then it can be properly tweaked per compiler and per arch as necessary instead of hardcoding the selection in each makefile... -- John-Mark Gurney Voice: +1 415 225 5579 All that I will do, has been done, All that I have, has not. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: SSE in libthr
Ok, so how do we reduce the amount of FPU save and restores, or make them cheaper? -a ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: SSE in libthr
If SIMD instructions are used for string proceccing, and FPU(AVX) contexts are NOT saved/restored properly on process (thread) switching, possibly processed string is destroyed by other process (thread). Can't it be a security risk? (Broken string parameter for syscalls, etc) If so, FPU (AVX) contexts should be saved/restored at least on process (thread) switching. *If SIMD instructions are NOT used in kernel and kernel modules at all, there would be no need for saving/restoring FPU contexts on interrupts. It's not limited in system libraries. As Alan noted, third party applications can use original string processing code using SIMD. On Fri, 27 Mar 2015 17:43:14 -0700 Adrian Chadd adr...@freebsd.org wrote: On 27 March 2015 at 16:03, Alan Somers asom...@freebsd.org wrote: On Fri, Mar 27, 2015 at 4:36 PM, Adrian Chadd adr...@freebsd.org wrote: hi, please don't try to microoptimise crap like strlen(). The TL;DR for performant high-throughput code is: if strlen() or memcpy() is the thing that's costing you the most, you're doing it wrong. -adrian I respectfully disagree. A well-optimized libc will benefit _every_single_program_ that uses strlen. That includes Apache, Samba, Memcached, Quake, and basically every single program that every single FreeBSD user uses. There's no reason that 3rd party software maintainers should have to rewrite basic libc functions in order to get decent performance on FreeBSD. And the downsides are so small! In 2015, we should assume by default that most userland software is using SIMD instructions. As Eric noticed, Clang emits them freely. What's the point to lazily saving the SSE registers on context switches if essentially all programs compiled from Ports will be using those registers anyway? I agree with Jilles; I think we should always save the SSE registers for userland programs. That's fine, but those benchmarks and improvements also have to take into account the environment that these programs are running in, and all of the other things that are going on with it. Fixing strlen() to use SSE2 is great, but if the gains are offset by fpu save/restore when doing fine grain locking that's blocking under real world workloads, what's the benefit? What about if the system is context switching over a million times a second? These are real life things I see servers running all of the above software /do/. One only knows with benchmarking, not microbenchmarking. Microbenchmarks are great. They serve a purpose, which is how the heck is the current silicon I'm running on run some code that I've cleverly crafted to hopefully run well. I'm totally for saving/restoring SSE registers for userland programs. But that's not where that kind of make stuff fast work should stop. If it does, and that's where your benchmarking for the real world stops, then you're doing it wrong. Everything is a toss-up. For this userland based netmap packet pushing app, SEE may be nice for some instructions, but know what else screws things? The fact that the default scheduler policy is terrible and crap gets scheduled /everywhere/ under any appreciable amount of load. That the context switch rate is high, the interrupt rate is also high, and with a little locking going on, I see fpu save/restore occur for a non-insignificant fraction of CPU. Optimising strlen() or memcpy() is great, but when my system context switches a million times a second, we're never going to reach the steady state that these CPUs can really crank out real work at under those conditions. So, cool. Please keep poking at that stuff. But if you stop short of making the system actually /be able to take advantage of them under load/, I respectfully ask for a nice knob I can use to turn them off. :) -adrian (Know where the slowdowns for memcached are? Hint - not strlen or memcpy. Yes, I've been down that rabbit hole recently. Know what /i/ have? 1 million UDP transactions a second working on 16 core sandybridge systems. Know what I didn't optimise? memcpy or strlen. The network stack locking and pthreads overhead is what sucks.) ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org -- 青木 知明 [Tomoaki AOKI] junch...@dec.sakura.ne.jp mxe02...@nifty.com ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: SSE in libthr
On Fri, Mar 27, 2015 at 10:40:57PM +0100, Jilles Tjoelker wrote: On Fri, Mar 27, 2015 at 03:26:17PM -0400, Eric van Gyzen wrote: In a nutshell: Clang emits SSE instructions on amd64 in the common path of pthread_mutex_unlock. This reduces performance by a non-trivial amount. I'd like to disable SSE in libthr. How about saving and restoring the FPU/SSE state eagerly instead of the current CR0.TS-based lazy method? There is overhead associated with #NM exception handling (fpudna) which is not worth it if FPU/SSE are used often. This would apply to userland threads only; kernel threads normally do not use FPU/SSE and handle the FPU/SSE state manually if they do. First, we have no choice but saving the FPU context when a thread is switched from. It is not practical to try to keep the state in the hardware, since fetching it to other core is too troublesome. Second, the biggest overhead of #NM is the reading of FPU context from memory (or cache), not the handler itself. The save area for SSE-capable machines, i.e. all amd64, is ~400 bytes, and XSAVEOPT does not help much for reading of legacy FPU + XMM state. It does help for YMM. That said, your proposal would force all threads to pay higher cost at the context switch time, increasing latency. There is performance improvement potential in using SSE for optimizing string functions, for example. Even a simple SSE2 strlen easily outperforms the already optimized lib/libc/string/strlen.c in a microbenchmark, and many other string functions are slow byte-at-a-time implementations. If the program does a lot of work with FPU between switches, the cost is obviously mitigated. Note that even for the worst case of the reported microbenchmark, the measured overhead is ~10-15%. So if string ops are indeed take significant share of the program time, the FPU #NM handling cost should be very low even with the current scheme. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
SSE in libthr
In a nutshell: Clang emits SSE instructions on amd64 in the common path of pthread_mutex_unlock. This reduces performance by a non-trivial amount. I'd like to disable SSE in libthr. In more detail: In libthr/thread/thr_mutex.c, we find the following: #define MUTEX_INIT_LINK(m) do {\ (m)-m_qe.tqe_prev = NULL; \ (m)-m_qe.tqe_next = NULL; \ } while (0) In 9.1, clang 3.1 emits two ordinary mov instructions: movq $0x0,0x8(%rax) movq $0x0,(%rax) Since 10.0 and clang 3.3, clang emits these SSE instructions: xorps %xmm0,%xmm0 movups %xmm0,(%rax) Although these look harmless enough, using the FPU can reduce performance by incurring extra overhead due to context-switching the FPU state. As I mentioned, this code is used in the common path of pthread_mutex_unlock. I have a simple test program that creates four threads, all contending for a single mutex, and measures the total number of lock acquisitions over several seconds. When libthr is built with SSE, as is current, I get around 53 million locks in 5 seconds. Without SSE, I get around 60 million (13% more). DTrace shows around 790,000 calls to fpudna versus 10 calls. There could be other factors involved, but I presume that the FPU context switches account for most of the change in performance. Even when I add some SSE usage in the application--incidentally, these same instructions--building libthr without SSE improves performance from 53.5 million to 55.8 million (4.3%). In the real-world application where I first noticed this, performance improves by 3-5%. I would appreciate your thoughts and feedback. The proposed patch is below. Eric Index: base/head/lib/libthr/arch/amd64/Makefile.inc === --- base/head/lib/libthr/arch/amd64/Makefile.inc(revision 280703) +++ base/head/lib/libthr/arch/amd64/Makefile.inc(working copy) @@ -1,3 +1,8 @@ #$FreeBSD$ SRCS+= _umtx_op_err.S + +# Using SSE incurs extra overhead per context switch, +# which measurably impacts performance when the application +# does not otherwise use FP/SSE. +CFLAGS+=-mno-sse ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: SSE in libthr
Wow. I remember seeing this in the work application - all packet pushing in userland, but there are locks being acquired. I was wondering what exactly was triggering the FPU save/restore code. Now I know. Yes, if there are no other objections, I'd love to see this in -HEAD and stable/10. -adrian ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: SSE in libthr
On Fri, 27 Mar 2015, Eric van Gyzen wrote: In a nutshell: Clang emits SSE instructions on amd64 in the common path of pthread_mutex_unlock. This reduces performance by a non-trivial amount. I'd like to disable SSE in libthr. This makes sense to me. -- DE ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: SSE in libthr
On Mar 27, 2015, at 12:26, Eric van Gyzen vangy...@freebsd.org wrote: In a nutshell: Clang emits SSE instructions on amd64 in the common path of pthread_mutex_unlock. This reduces performance by a non-trivial amount. I'd like to disable SSE in libthr. In more detail: In libthr/thread/thr_mutex.c, we find the following: #define MUTEX_INIT_LINK(m) do {\ (m)-m_qe.tqe_prev = NULL; \ (m)-m_qe.tqe_next = NULL; \ } while (0) In 9.1, clang 3.1 emits two ordinary mov instructions: movq $0x0,0x8(%rax) movq $0x0,(%rax) Since 10.0 and clang 3.3, clang emits these SSE instructions: xorps %xmm0,%xmm0 movups %xmm0,(%rax) Although these look harmless enough, using the FPU can reduce performance by incurring extra overhead due to context-switching the FPU state. As I mentioned, this code is used in the common path of pthread_mutex_unlock. I have a simple test program that creates four threads, all contending for a single mutex, and measures the total number of lock acquisitions over several seconds. When libthr is built with SSE, as is current, I get around 53 million locks in 5 seconds. Without SSE, I get around 60 million (13% more). DTrace shows around 790,000 calls to fpudna versus 10 calls. There could be other factors involved, but I presume that the FPU context switches account for most of the change in performance. Even when I add some SSE usage in the application--incidentally, these same instructions--building libthr without SSE improves performance from 53.5 million to 55.8 million (4.3%). In the real-world application where I first noticed this, performance improves by 3-5%. I would appreciate your thoughts and feedback. The proposed patch is below. Eric Index: base/head/lib/libthr/arch/amd64/Makefile.inc === --- base/head/lib/libthr/arch/amd64/Makefile.inc (revision 280703) +++ base/head/lib/libthr/arch/amd64/Makefile.inc (working copy) @@ -1,3 +1,8 @@ #$FreeBSD$ SRCS+=_umtx_op_err.S + +# Using SSE incurs extra overhead per context switch, +# which measurably impacts performance when the application +# does not otherwise use FP/SSE. +CFLAGS+=-mno-sse Good catch! Regarding your patch, I think we should disable even more, if possible. How about: CFLAGS+=-mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 -- Rui Paulo ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: SSE in libthr
On Fri, Mar 27, 2015 at 03:26:17PM -0400, Eric van Gyzen wrote: In a nutshell: Clang emits SSE instructions on amd64 in the common path of pthread_mutex_unlock. This reduces performance by a non-trivial amount. I'd like to disable SSE in libthr. How about saving and restoring the FPU/SSE state eagerly instead of the current CR0.TS-based lazy method? There is overhead associated with #NM exception handling (fpudna) which is not worth it if FPU/SSE are used often. This would apply to userland threads only; kernel threads normally do not use FPU/SSE and handle the FPU/SSE state manually if they do. There is performance improvement potential in using SSE for optimizing string functions, for example. Even a simple SSE2 strlen easily outperforms the already optimized lib/libc/string/strlen.c in a microbenchmark, and many other string functions are slow byte-at-a-time implementations. -- Jilles Tjoelker ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: SSE in libthr
On 03/27/2015 16:49, Rui Paulo wrote: Regarding your patch, I think we should disable even more, if possible. How about: CFLAGS+=-mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 Yes, I was considering copying all of the similar flags that we use in the kernel. That seems wise. According to comments in sys/conf/kern.mk, only no-mmx and no-sse would be necessary, as they imply the others. dim@ raised the possibility of CPUTYPE=foo on i386, so I would also apply this change to i386. An updated patch is below. Eric Index: base/head/lib/libthr/arch/amd64/Makefile.inc === --- base/head/lib/libthr/arch/amd64/Makefile.inc(revision 280703) +++ base/head/lib/libthr/arch/amd64/Makefile.inc(working copy) @@ -1,3 +1,8 @@ #$FreeBSD$ SRCS+=_umtx_op_err.S + +# Using SSE incurs extra overhead per context switch, +# which measurably impacts performance when the application +# does not otherwise use FP/SSE. +CFLAGS+=-mno-sse -mno-mmx Index: base/head/lib/libthr/arch/i386/Makefile.inc === --- base/head/lib/libthr/arch/i386/Makefile.inc(revision 280703) +++ base/head/lib/libthr/arch/i386/Makefile.inc(working copy) @@ -1,3 +1,8 @@ # $FreeBSD$ SRCS+=_umtx_op_err.S + +# Using SSE incurs extra overhead per context switch, +# which measurably impacts performance when the application +# does not otherwise use FP/SSE. +CFLAGS+=-mno-sse -mno-mmx ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: SSE in libthr
On Fri, Mar 27, 2015 at 01:49:03PM -0700, Rui Paulo wrote: On Mar 27, 2015, at 12:26, Eric van Gyzen vangy...@freebsd.org wrote: In a nutshell: Clang emits SSE instructions on amd64 in the common path of pthread_mutex_unlock. This reduces performance by a non-trivial amount. I'd like to disable SSE in libthr. In more detail: In libthr/thread/thr_mutex.c, we find the following: #define MUTEX_INIT_LINK(m) do {\ (m)-m_qe.tqe_prev = NULL; \ (m)-m_qe.tqe_next = NULL; \ } while (0) In 9.1, clang 3.1 emits two ordinary mov instructions: movq $0x0,0x8(%rax) movq $0x0,(%rax) Since 10.0 and clang 3.3, clang emits these SSE instructions: xorps %xmm0,%xmm0 movups %xmm0,(%rax) Although these look harmless enough, using the FPU can reduce performance by incurring extra overhead due to context-switching the FPU state. As I mentioned, this code is used in the common path of pthread_mutex_unlock. I have a simple test program that creates four threads, all contending for a single mutex, and measures the total number of lock acquisitions over several seconds. When libthr is built with SSE, as is current, I get around 53 million locks in 5 seconds. Without SSE, I get around 60 million (13% more). DTrace shows around 790,000 calls to fpudna versus 10 calls. There could be other factors involved, but I presume that the FPU context switches account for most of the change in performance. Even when I add some SSE usage in the application--incidentally, these same instructions--building libthr without SSE improves performance from 53.5 million to 55.8 million (4.3%). In the real-world application where I first noticed this, performance improves by 3-5%. I would appreciate your thoughts and feedback. The proposed patch is below. Eric Index: base/head/lib/libthr/arch/amd64/Makefile.inc === --- base/head/lib/libthr/arch/amd64/Makefile.inc(revision 280703) +++ base/head/lib/libthr/arch/amd64/Makefile.inc(working copy) @@ -1,3 +1,8 @@ #$FreeBSD$ SRCS+= _umtx_op_err.S + +# Using SSE incurs extra overhead per context switch, +# which measurably impacts performance when the application +# does not otherwise use FP/SSE. +CFLAGS+=-mno-sse Good catch! Regarding your patch, I think we should disable even more, if possible. How about: CFLAGS+=-mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 I think so. Also, this should be done for libc as well, both on i386 and amd64. I am not sure, should compiler-rt be included into the set ? ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: SSE in libthr
hi, please don't try to microoptimise crap like strlen(). The TL;DR for performant high-throughput code is: if strlen() or memcpy() is the thing that's costing you the most, you're doing it wrong. -adrian ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: SSE in libthr
On Fri, Mar 27, 2015 at 4:36 PM, Adrian Chadd adr...@freebsd.org wrote: hi, please don't try to microoptimise crap like strlen(). The TL;DR for performant high-throughput code is: if strlen() or memcpy() is the thing that's costing you the most, you're doing it wrong. -adrian I respectfully disagree. A well-optimized libc will benefit _every_single_program_ that uses strlen. That includes Apache, Samba, Memcached, Quake, and basically every single program that every single FreeBSD user uses. There's no reason that 3rd party software maintainers should have to rewrite basic libc functions in order to get decent performance on FreeBSD. And the downsides are so small! In 2015, we should assume by default that most userland software is using SIMD instructions. As Eric noticed, Clang emits them freely. What's the point to lazily saving the SSE registers on context switches if essentially all programs compiled from Ports will be using those registers anyway? I agree with Jilles; I think we should always save the SSE registers for userland programs. -Alan ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: SSE in libthr
On 27 March 2015 at 16:03, Alan Somers asom...@freebsd.org wrote: On Fri, Mar 27, 2015 at 4:36 PM, Adrian Chadd adr...@freebsd.org wrote: hi, please don't try to microoptimise crap like strlen(). The TL;DR for performant high-throughput code is: if strlen() or memcpy() is the thing that's costing you the most, you're doing it wrong. -adrian I respectfully disagree. A well-optimized libc will benefit _every_single_program_ that uses strlen. That includes Apache, Samba, Memcached, Quake, and basically every single program that every single FreeBSD user uses. There's no reason that 3rd party software maintainers should have to rewrite basic libc functions in order to get decent performance on FreeBSD. And the downsides are so small! In 2015, we should assume by default that most userland software is using SIMD instructions. As Eric noticed, Clang emits them freely. What's the point to lazily saving the SSE registers on context switches if essentially all programs compiled from Ports will be using those registers anyway? I agree with Jilles; I think we should always save the SSE registers for userland programs. That's fine, but those benchmarks and improvements also have to take into account the environment that these programs are running in, and all of the other things that are going on with it. Fixing strlen() to use SSE2 is great, but if the gains are offset by fpu save/restore when doing fine grain locking that's blocking under real world workloads, what's the benefit? What about if the system is context switching over a million times a second? These are real life things I see servers running all of the above software /do/. One only knows with benchmarking, not microbenchmarking. Microbenchmarks are great. They serve a purpose, which is how the heck is the current silicon I'm running on run some code that I've cleverly crafted to hopefully run well. I'm totally for saving/restoring SSE registers for userland programs. But that's not where that kind of make stuff fast work should stop. If it does, and that's where your benchmarking for the real world stops, then you're doing it wrong. Everything is a toss-up. For this userland based netmap packet pushing app, SEE may be nice for some instructions, but know what else screws things? The fact that the default scheduler policy is terrible and crap gets scheduled /everywhere/ under any appreciable amount of load. That the context switch rate is high, the interrupt rate is also high, and with a little locking going on, I see fpu save/restore occur for a non-insignificant fraction of CPU. Optimising strlen() or memcpy() is great, but when my system context switches a million times a second, we're never going to reach the steady state that these CPUs can really crank out real work at under those conditions. So, cool. Please keep poking at that stuff. But if you stop short of making the system actually /be able to take advantage of them under load/, I respectfully ask for a nice knob I can use to turn them off. :) -adrian (Know where the slowdowns for memcached are? Hint - not strlen or memcpy. Yes, I've been down that rabbit hole recently. Know what /i/ have? 1 million UDP transactions a second working on 16 core sandybridge systems. Know what I didn't optimise? memcpy or strlen. The network stack locking and pthreads overhead is what sucks.) ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: SSE in libthr
Possibly related information. Recently, I tried to build world/kernel (head, r280410, amd64) with CPUTYPE setting in make.conf. Real CPU is sandybridge (corei7-avx). Running in VirtualBox VM, installworld fails with CPUTYPE?=corei7-avx, while with CPUTYPE?=corei7 everything goes OK. *Rebooting after installkernel and etcupdate -p goes OK, but rebooting after failed installworld causes even /bin/sh fail to start (kernel starts OK). Yes, it would be the problem (or limitation) of VirtualBox and NOT of FreeBSD, as memstick image built from /usr/obj with CPUTYPE?=corei7-avx runs OK in real hardware. This should mean some AVX instructions are generated by clang 3.6.0 for userland, and VirtualBox doesn't like them. On Fri, 27 Mar 2015 15:26:17 -0400 Eric van Gyzen vangy...@freebsd.org wrote: In a nutshell: Clang emits SSE instructions on amd64 in the common path of pthread_mutex_unlock. This reduces performance by a non-trivial amount. I'd like to disable SSE in libthr. In more detail: In libthr/thread/thr_mutex.c, we find the following: #define MUTEX_INIT_LINK(m) do {\ (m)-m_qe.tqe_prev = NULL; \ (m)-m_qe.tqe_next = NULL; \ } while (0) In 9.1, clang 3.1 emits two ordinary mov instructions: movq $0x0,0x8(%rax) movq $0x0,(%rax) Since 10.0 and clang 3.3, clang emits these SSE instructions: xorps %xmm0,%xmm0 movups %xmm0,(%rax) Although these look harmless enough, using the FPU can reduce performance by incurring extra overhead due to context-switching the FPU state. As I mentioned, this code is used in the common path of pthread_mutex_unlock. I have a simple test program that creates four threads, all contending for a single mutex, and measures the total number of lock acquisitions over several seconds. When libthr is built with SSE, as is current, I get around 53 million locks in 5 seconds. Without SSE, I get around 60 million (13% more). DTrace shows around 790,000 calls to fpudna versus 10 calls. There could be other factors involved, but I presume that the FPU context switches account for most of the change in performance. Even when I add some SSE usage in the application--incidentally, these same instructions--building libthr without SSE improves performance from 53.5 million to 55.8 million (4.3%). In the real-world application where I first noticed this, performance improves by 3-5%. I would appreciate your thoughts and feedback. The proposed patch is below. Eric Index: base/head/lib/libthr/arch/amd64/Makefile.inc === --- base/head/lib/libthr/arch/amd64/Makefile.inc (revision 280703) +++ base/head/lib/libthr/arch/amd64/Makefile.inc (working copy) @@ -1,3 +1,8 @@ #$FreeBSD$ SRCS+= _umtx_op_err.S + +# Using SSE incurs extra overhead per context switch, +# which measurably impacts performance when the application +# does not otherwise use FP/SSE. +CFLAGS+=-mno-sse ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org -- Tomoaki AOKIjunch...@dec.sakura.ne.jp ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org