Re: SSE in libthr

2015-04-14 Thread Eric van Gyzen
Below is an updated patch to incorporate everyone's feedback so far.

I recognize all of the counter-arguments, and I agree with them in general.
Indeed, as applications use more SIMD, this kind of patch goes in the wrong
direction.  However, there are applications that do not use enough SSE to offset
the extra context-switch cost.  SSE does not provide a clear benefit in the
current libthr code with the current compiler, but it does provide a clear loss
in some cases.  Therefore, disabling SSE in libthr is a non-loss for most, and a
gain for some.

I refrained from disabling SSE in libc--as was suggested--because I can't make
the above argument for libc.  It provides such a variety of code that SSE might
be a net win in some cases.  I wish I had time to identify and benchmark the
interesting cases.

Thanks in advance for your further review and comments.

Eric



Index: head/lib/libthr/arch/amd64/Makefile.inc
===
--- head/lib/libthr/arch/amd64/Makefile.inc (revision 281473)
+++ head/lib/libthr/arch/amd64/Makefile.inc (working copy)
@@ -1,3 +1,9 @@
 #$FreeBSD$

 SRCS+= _umtx_op_err.S
+
+# With the current compiler and libthr code, using SSE in libthr
+# does not provide enough performance improvement to outweigh
+# the extra context switch cost.  This can measurably impact
+# performance when the application also does not use enough SSE.
+CFLAGS+=${CFLAGS_NO_SIMD}
Index: head/lib/libthr/arch/i386/Makefile.inc
===
--- head/lib/libthr/arch/i386/Makefile.inc  (revision 281473)
+++ head/lib/libthr/arch/i386/Makefile.inc  (working copy)
@@ -1,3 +1,9 @@
 # $FreeBSD$

 SRCS+= _umtx_op_err.S
+
+# With the current compiler and libthr code, using SSE in libthr
+# does not provide enough performance improvement to outweigh
+# the extra context switch cost.  This can measurably impact
+# performance when the application also does not use enough SSE.
+CFLAGS+=${CFLAGS_NO_SIMD}
Index: head/libexec/rtld-elf/amd64/Makefile.inc
===
--- head/libexec/rtld-elf/amd64/Makefile.inc(revision 281473)
+++ head/libexec/rtld-elf/amd64/Makefile.inc(working copy)
@@ -1,6 +1,6 @@
 # $FreeBSD$

-CFLAGS+=   -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 -msoft-float
+CFLAGS+=   ${CFLAGS_NO_SIMD} -msoft-float
 # Uncomment this to build the dynamic linker as an executable instead
 # of a shared library:
 #LDSCRIPT= ${.CURDIR}/${MACHINE_CPUARCH}/elf_rtld.x
Index: head/libexec/rtld-elf/i386/Makefile.inc
===
--- head/libexec/rtld-elf/i386/Makefile.inc (revision 281473)
+++ head/libexec/rtld-elf/i386/Makefile.inc (working copy)
@@ -1,6 +1,6 @@
 # $FreeBSD$

-CFLAGS+=   -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 -msoft-float
+CFLAGS+=   ${CFLAGS_NO_SIMD} -msoft-float
 # Uncomment this to build the dynamic linker as an executable instead
 # of a shared library:
 #LDSCRIPT= ${.CURDIR}/${MACHINE_CPUARCH}/elf_rtld.x
Index: head/share/mk/bsd.sys.mk
===
--- head/share/mk/bsd.sys.mk(revision 281473)
+++ head/share/mk/bsd.sys.mk(working copy)
@@ -153,6 +153,26 @@ SSP_CFLAGS?=   -fstack-protector
 CFLAGS+=   ${SSP_CFLAGS}
 .endif # SSP  !ARM  !MIPS

+#
+# Prohibit the compiler from emitting SIMD instructions.
+# These flags are added to CFLAGS in areas where the extra context-switch
+# cost outweighs the advantages of SIMD instructions.
+#
+# gcc:
+# Setting -mno-mmx implies -mno-3dnow
+# Setting -mno-sse implies -mno-sse2, -mno-sse3, -mno-ssse3 and -mfpmath=387
+#
+# clang:
+# Setting -mno-mmx implies -mno-3dnow and -mno-3dnowa
+# Setting -mno-sse implies -mno-sse2, -mno-sse3, -mno-ssse3, -mno-sse41 and
+# -mno-sse42
+# (-mfpmath= is not supported)
+#
+.if ${MACHINE_CPUARCH} == i386 || ${MACHINE_CPUARCH} == amd64
+CFLAGS_NO_SIMD.clang=  -mno-avx
+CFLAGS_NO_SIMD=-mno-mmx -mno-sse ${CFLAGS_NO_SIMD.${COMPILER_TYPE}}
+.endif
+
 # Allow user-specified additional warning flags, plus compiler specific flag
overrides.
 # Unless we've overriden this...
 .if ${MK_WARNS} != no
Index: head/sys/conf/kern.mk
===
--- head/sys/conf/kern.mk   (revision 281473)
+++ head/sys/conf/kern.mk   (working copy)
@@ -75,18 +75,10 @@ FORMAT_EXTENSIONS=  -fformat-extensions
 # operations inside the kernel itself.  These operations are exclusively
 # reserved for user applications.
 #
-# gcc:
-# Setting -mno-mmx implies -mno-3dnow
-# Setting -mno-sse implies -mno-sse2, -mno-sse3 and -mno-ssse3
-#
-# clang:
-# Setting -mno-mmx implies -mno-3dnow and -mno-3dnowa
-# Setting -mno-sse implies -mno-sse2, -mno-sse3, -mno-ssse3, -mno-sse41 and
-mno-sse42
-#
 .if ${MACHINE_CPUARCH} == i386
 CFLAGS.gcc+=   -mno-align-long

Re: SSE in libthr

2015-04-06 Thread John Baldwin
On Saturday, March 28, 2015 10:41:48 AM Adrian Chadd wrote:
 Ok, so how do we reduce the amount of FPU save and restores, or make
 them cheaper?

Or make them more useful.  If you are using SSE/AVX more often between context
switches in ways that are beneficial then that might offset the cost of the
save and restore and result in a net win.  I have variants of strlen, memcpy,
and memset that use SSE.  However, microbenchmarks aren't super useful as you
have noted.  If you would like to try these out in some real workloads I can
provide a patch to libc.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: SSE in libthr

2015-03-28 Thread Julian Elischer

On 3/28/15 5:44 AM, Konstantin Belousov wrote:

On Fri, Mar 27, 2015 at 01:49:03PM -0700, Rui Paulo wrote:

On Mar 27, 2015, at 12:26, Eric van Gyzen vangy...@freebsd.org wrote:

In a nutshell:

Clang emits SSE instructions on amd64 in the common path of
pthread_mutex_unlock.  This reduces performance by a non-trivial amount.  I'd
like to disable SSE in libthr.

In more detail:

In libthr/thread/thr_mutex.c, we find the following:

#define MUTEX_INIT_LINK(m)  do {\
(m)-m_qe.tqe_prev = NULL;  \
(m)-m_qe.tqe_next = NULL;  \
} while (0)

In 9.1, clang 3.1 emits two ordinary mov instructions:

movq   $0x0,0x8(%rax)
movq   $0x0,(%rax)

Since 10.0 and clang 3.3, clang emits these SSE instructions:

xorps  %xmm0,%xmm0
movups %xmm0,(%rax)

Although these look harmless enough, using the FPU can reduce performance by
incurring extra overhead due to context-switching the FPU state.

As I mentioned, this code is used in the common path of pthread_mutex_unlock.  I
have a simple test program that creates four threads, all contending for a
single mutex, and measures the total number of lock acquisitions over several
seconds.  When libthr is built with SSE, as is current, I get around 53 million
locks in 5 seconds.  Without SSE, I get around 60 million (13% more).  DTrace
shows around 790,000 calls to fpudna versus 10 calls.  There could be other
factors involved, but I presume that the FPU context switches account for most
of the change in performance.

Even when I add some SSE usage in the application--incidentally, these same
instructions--building libthr without SSE improves performance from 53.5 million
to 55.8 million (4.3%).

In the real-world application where I first noticed this, performance improves
by 3-5%.

I would appreciate your thoughts and feedback.  The proposed patch is below.

Eric



Index: base/head/lib/libthr/arch/amd64/Makefile.inc
===
--- base/head/lib/libthr/arch/amd64/Makefile.inc(revision 280703)
+++ base/head/lib/libthr/arch/amd64/Makefile.inc(working copy)
@@ -1,3 +1,8 @@
#$FreeBSD$

SRCS+=  _umtx_op_err.S
+
+# Using SSE incurs extra overhead per context switch,
+# which measurably impacts performance when the application
+# does not otherwise use FP/SSE.
+CFLAGS+=-mno-sse

Good catch!

Regarding your patch, I think we should disable even more, if possible.  How 
about:

CFLAGS+=-mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3

I think so.

Also, this should be done for libc as well, both on i386 and amd64.
I am not sure, should compiler-rt be included into the set ?
the point is that clang will do this anywhere it can, because it isn't 
taking into account the

side effects, just the speed of the commands themselves.


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org



___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: SSE in libthr

2015-03-28 Thread David Chisnall
On 28 Mar 2015, at 13:54, Julian Elischer jul...@freebsd.org wrote:
 
 the point is that clang will do this anywhere it can, because it isn't taking 
 into account the
 side effects, just the speed of the commands themselves.

This is also something that is not going to decrease.  Clang now enables the 
SLP vectoriser by default and this code is constantly being improved.  Current 
generation vector units are explicitly designed as targets for compiler 
autovectorisation, not for hand-tuned DSP code (which, increasingly, runs on 
the GPU anyway).  This means that we're increasingly going to see SSE/AVX/NEON 
usage in CPU-bound code, even without an explicit programmer decision to do so. 
 Optimising for the case when the vector unit is not used is about as sensible 
as optimising for the single-core case: it will affect some people, but 
generally not those who care about performance, and a decreasing number of 
people over time.

David

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: SSE in libthr

2015-03-28 Thread John-Mark Gurney
Eric van Gyzen wrote this message on Fri, Mar 27, 2015 at 17:43 -0400:
 On 03/27/2015 16:49, Rui Paulo wrote:
 
  Regarding your patch, I think we should disable even more, if possible.  
  How about:
 
  CFLAGS+=-mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3
 
 Yes, I was considering copying all of the similar flags that we use in the
 kernel.  That seems wise.  According to comments in sys/conf/kern.mk, only
 no-mmx and no-sse would be necessary, as they imply the others.
 
 dim@ raised the possibility of CPUTYPE=foo on i386, so I would also apply this
 change to i386.
 
 An updated patch is below.

We should probably add a $(CFLAGS_NOFPU) define and use that.. Then it
can be properly tweaked per compiler and per arch as necessary instead
of hardcoding the selection in each makefile...

-- 
  John-Mark Gurney  Voice: +1 415 225 5579

 All that I will do, has been done, All that I have, has not.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: SSE in libthr

2015-03-28 Thread Adrian Chadd
Ok, so how do we reduce the amount of FPU save and restores, or make
them cheaper?



-a
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: SSE in libthr

2015-03-28 Thread Tomoaki AOKI
If SIMD instructions are used for string proceccing, and FPU(AVX)
contexts are NOT saved/restored properly on process (thread) switching,
possibly processed string is destroyed by other process (thread).
Can't it be a security risk? (Broken string parameter for syscalls, etc)

If so, FPU (AVX) contexts should be saved/restored at least on process
(thread) switching.

 *If SIMD instructions are NOT used in kernel and kernel modules at all,
  there would be no need for saving/restoring FPU contexts on
  interrupts.

It's not limited in system libraries. As Alan noted, third party
applications can use original string processing code using SIMD.


On Fri, 27 Mar 2015 17:43:14 -0700
Adrian Chadd adr...@freebsd.org wrote:

 On 27 March 2015 at 16:03, Alan Somers asom...@freebsd.org wrote:
  On Fri, Mar 27, 2015 at 4:36 PM, Adrian Chadd adr...@freebsd.org wrote:
  hi,
 
  please don't try to microoptimise crap like strlen().
 
  The TL;DR for performant high-throughput code is: if strlen() or
  memcpy() is the thing that's costing you the most, you're doing it
  wrong.
 
 
 
  -adrian
 
  I respectfully disagree.  A well-optimized libc will benefit
  _every_single_program_ that uses strlen.  That includes Apache, Samba,
  Memcached, Quake, and basically every single program that every single
  FreeBSD user uses.  There's no reason that 3rd party software
  maintainers should have to rewrite basic libc functions in order to
  get decent performance on FreeBSD.  And the downsides are so small!
  In 2015, we should assume by default that most userland software is
  using SIMD instructions.  As Eric noticed, Clang emits them freely.
  What's the point to lazily saving the SSE registers on context
  switches if essentially all programs compiled from Ports will be using
  those registers anyway?  I agree with Jilles; I think we should always
  save the SSE registers for userland programs.
 
 That's fine, but those benchmarks and improvements also have to take
 into account the environment that these programs are running in, and
 all of the other things that are going on with it.
 
 Fixing strlen() to use SSE2 is great, but if the gains are offset by
 fpu save/restore when doing fine grain locking that's blocking under
 real world workloads, what's the benefit? What about if the system is
 context switching over a million times a second? These are real life
 things I see servers running all of the above software /do/.
 
 One only knows with benchmarking, not microbenchmarking.
 
 Microbenchmarks are great. They serve a purpose, which is how the
 heck is the current silicon I'm running on run some code that I've
 cleverly crafted to hopefully run well.
 
 I'm totally for saving/restoring SSE registers for userland programs.
 But that's not where that kind of make stuff fast work should stop.
 If it does, and that's where your benchmarking for the real world
 stops, then you're doing it wrong.
 
 Everything is a toss-up. For this userland based netmap packet pushing
 app, SEE may be nice for some instructions, but know what else screws
 things? The fact that the default scheduler policy is terrible and
 crap gets scheduled /everywhere/ under any appreciable amount of load.
 That the context switch rate is high, the interrupt rate is also high,
 and with a little locking going on, I see fpu save/restore occur for a
 non-insignificant fraction of CPU. Optimising strlen() or memcpy() is
 great, but when my system context switches a million times a second,
 we're never going to reach the steady state that these CPUs can really
 crank out real work at under those conditions.
 
 So, cool. Please keep poking at that stuff. But if you stop short of
 making the system actually /be able to take advantage of them under
 load/, I respectfully ask for a nice knob I can use to turn them off.
 :)
 
 
 
 -adrian
 
 (Know where the slowdowns for memcached are? Hint - not strlen or
 memcpy. Yes, I've been down that rabbit hole recently. Know what /i/
 have? 1 million UDP transactions a second working on 16 core
 sandybridge systems. Know what I didn't optimise? memcpy or strlen.
 The network stack locking and pthreads overhead is what sucks.)
 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
 


-- 
青木 知明  [Tomoaki AOKI]
junch...@dec.sakura.ne.jp
mxe02...@nifty.com
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: SSE in libthr

2015-03-28 Thread Konstantin Belousov
On Fri, Mar 27, 2015 at 10:40:57PM +0100, Jilles Tjoelker wrote:
 On Fri, Mar 27, 2015 at 03:26:17PM -0400, Eric van Gyzen wrote:
  In a nutshell:
 
  Clang emits SSE instructions on amd64 in the common path of
  pthread_mutex_unlock.  This reduces performance by a non-trivial
  amount.  I'd like to disable SSE in libthr.
 
 How about saving and restoring the FPU/SSE state eagerly instead of the
 current CR0.TS-based lazy method? There is overhead associated with #NM
 exception handling (fpudna) which is not worth it if FPU/SSE are used
 often. This would apply to userland threads only; kernel threads
 normally do not use FPU/SSE and handle the FPU/SSE state manually if
 they do.
First, we have no choice but saving the FPU context when a thread is
switched from.  It is not practical to try to keep the state in the
hardware, since fetching it to other core is too troublesome.

Second, the biggest overhead of #NM is the reading of FPU context from
memory (or cache), not the handler itself.  The save area for SSE-capable
machines, i.e. all amd64, is ~400 bytes, and XSAVEOPT does not help
much for reading of legacy FPU + XMM state.  It does help for YMM.

That said, your proposal would force all threads to pay higher cost at
the context switch time, increasing latency.

 
 There is performance improvement potential in using SSE for optimizing
 string functions, for example. Even a simple SSE2 strlen easily
 outperforms the already optimized lib/libc/string/strlen.c in a
 microbenchmark, and many other string functions are slow byte-at-a-time
 implementations.

If the program does a lot of work with FPU between switches, the cost
is obviously mitigated.  Note that even for the worst case
of the reported microbenchmark, the measured overhead is ~10-15%.
So if string ops are indeed take significant share of the program time,
the FPU #NM handling cost should be very low even with the current
scheme.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


SSE in libthr

2015-03-27 Thread Eric van Gyzen
In a nutshell:

Clang emits SSE instructions on amd64 in the common path of
pthread_mutex_unlock.  This reduces performance by a non-trivial amount.  I'd
like to disable SSE in libthr.

In more detail:

In libthr/thread/thr_mutex.c, we find the following:

#define MUTEX_INIT_LINK(m)  do {\
(m)-m_qe.tqe_prev = NULL;  \
(m)-m_qe.tqe_next = NULL;  \
} while (0)

In 9.1, clang 3.1 emits two ordinary mov instructions:

movq   $0x0,0x8(%rax)
movq   $0x0,(%rax)

Since 10.0 and clang 3.3, clang emits these SSE instructions:

xorps  %xmm0,%xmm0
movups %xmm0,(%rax)

Although these look harmless enough, using the FPU can reduce performance by
incurring extra overhead due to context-switching the FPU state.

As I mentioned, this code is used in the common path of pthread_mutex_unlock.  I
have a simple test program that creates four threads, all contending for a
single mutex, and measures the total number of lock acquisitions over several
seconds.  When libthr is built with SSE, as is current, I get around 53 million
locks in 5 seconds.  Without SSE, I get around 60 million (13% more).  DTrace
shows around 790,000 calls to fpudna versus 10 calls.  There could be other
factors involved, but I presume that the FPU context switches account for most
of the change in performance.

Even when I add some SSE usage in the application--incidentally, these same
instructions--building libthr without SSE improves performance from 53.5 million
to 55.8 million (4.3%).

In the real-world application where I first noticed this, performance improves
by 3-5%.

I would appreciate your thoughts and feedback.  The proposed patch is below.

Eric



Index: base/head/lib/libthr/arch/amd64/Makefile.inc
===
--- base/head/lib/libthr/arch/amd64/Makefile.inc(revision 280703)
+++ base/head/lib/libthr/arch/amd64/Makefile.inc(working copy)
@@ -1,3 +1,8 @@
 #$FreeBSD$

 SRCS+= _umtx_op_err.S
+
+# Using SSE incurs extra overhead per context switch,
+# which measurably impacts performance when the application
+# does not otherwise use FP/SSE.
+CFLAGS+=-mno-sse
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: SSE in libthr

2015-03-27 Thread Adrian Chadd
Wow. I remember seeing this in the work application - all packet
pushing in userland, but there are locks being acquired. I was
wondering what exactly was triggering the FPU save/restore code. Now I
know.

Yes, if there are no other objections, I'd love to see this in -HEAD
and stable/10.


-adrian
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: SSE in libthr

2015-03-27 Thread Daniel Eischen

On Fri, 27 Mar 2015, Eric van Gyzen wrote:


In a nutshell:

Clang emits SSE instructions on amd64 in the common path of
pthread_mutex_unlock.  This reduces performance by a non-trivial amount.  I'd
like to disable SSE in libthr.


This makes sense to me.

--
DE
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: SSE in libthr

2015-03-27 Thread Rui Paulo
On Mar 27, 2015, at 12:26, Eric van Gyzen vangy...@freebsd.org wrote:
 
 In a nutshell:
 
 Clang emits SSE instructions on amd64 in the common path of
 pthread_mutex_unlock.  This reduces performance by a non-trivial amount.  I'd
 like to disable SSE in libthr.
 
 In more detail:
 
 In libthr/thread/thr_mutex.c, we find the following:
 
   #define MUTEX_INIT_LINK(m)  do {\
   (m)-m_qe.tqe_prev = NULL;  \
   (m)-m_qe.tqe_next = NULL;  \
   } while (0)
 
 In 9.1, clang 3.1 emits two ordinary mov instructions:
 
   movq   $0x0,0x8(%rax)
   movq   $0x0,(%rax)
 
 Since 10.0 and clang 3.3, clang emits these SSE instructions:
 
   xorps  %xmm0,%xmm0
   movups %xmm0,(%rax)
 
 Although these look harmless enough, using the FPU can reduce performance by
 incurring extra overhead due to context-switching the FPU state.
 
 As I mentioned, this code is used in the common path of pthread_mutex_unlock. 
  I
 have a simple test program that creates four threads, all contending for a
 single mutex, and measures the total number of lock acquisitions over several
 seconds.  When libthr is built with SSE, as is current, I get around 53 
 million
 locks in 5 seconds.  Without SSE, I get around 60 million (13% more).  DTrace
 shows around 790,000 calls to fpudna versus 10 calls.  There could be other
 factors involved, but I presume that the FPU context switches account for most
 of the change in performance.
 
 Even when I add some SSE usage in the application--incidentally, these same
 instructions--building libthr without SSE improves performance from 53.5 
 million
 to 55.8 million (4.3%).
 
 In the real-world application where I first noticed this, performance improves
 by 3-5%.
 
 I would appreciate your thoughts and feedback.  The proposed patch is below.
 
 Eric
 
 
 
 Index: base/head/lib/libthr/arch/amd64/Makefile.inc
 ===
 --- base/head/lib/libthr/arch/amd64/Makefile.inc  (revision 280703)
 +++ base/head/lib/libthr/arch/amd64/Makefile.inc  (working copy)
 @@ -1,3 +1,8 @@
 #$FreeBSD$
 
 SRCS+=_umtx_op_err.S
 +
 +# Using SSE incurs extra overhead per context switch,
 +# which measurably impacts performance when the application
 +# does not otherwise use FP/SSE.
 +CFLAGS+=-mno-sse

Good catch!

Regarding your patch, I think we should disable even more, if possible.  How 
about:

CFLAGS+=-mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3

--
Rui Paulo



___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: SSE in libthr

2015-03-27 Thread Jilles Tjoelker
On Fri, Mar 27, 2015 at 03:26:17PM -0400, Eric van Gyzen wrote:
 In a nutshell:

 Clang emits SSE instructions on amd64 in the common path of
 pthread_mutex_unlock.  This reduces performance by a non-trivial
 amount.  I'd like to disable SSE in libthr.

How about saving and restoring the FPU/SSE state eagerly instead of the
current CR0.TS-based lazy method? There is overhead associated with #NM
exception handling (fpudna) which is not worth it if FPU/SSE are used
often. This would apply to userland threads only; kernel threads
normally do not use FPU/SSE and handle the FPU/SSE state manually if
they do.

There is performance improvement potential in using SSE for optimizing
string functions, for example. Even a simple SSE2 strlen easily
outperforms the already optimized lib/libc/string/strlen.c in a
microbenchmark, and many other string functions are slow byte-at-a-time
implementations.

-- 
Jilles Tjoelker
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: SSE in libthr

2015-03-27 Thread Eric van Gyzen
On 03/27/2015 16:49, Rui Paulo wrote:

 Regarding your patch, I think we should disable even more, if possible.  How 
 about:

 CFLAGS+=-mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3

Yes, I was considering copying all of the similar flags that we use in the
kernel.  That seems wise.  According to comments in sys/conf/kern.mk, only
no-mmx and no-sse would be necessary, as they imply the others.

dim@ raised the possibility of CPUTYPE=foo on i386, so I would also apply this
change to i386.

An updated patch is below.

Eric


Index: base/head/lib/libthr/arch/amd64/Makefile.inc
===
--- base/head/lib/libthr/arch/amd64/Makefile.inc(revision 280703)
+++ base/head/lib/libthr/arch/amd64/Makefile.inc(working copy)
@@ -1,3 +1,8 @@
 #$FreeBSD$
 
 SRCS+=_umtx_op_err.S
+
+# Using SSE incurs extra overhead per context switch,
+# which measurably impacts performance when the application
+# does not otherwise use FP/SSE.
+CFLAGS+=-mno-sse -mno-mmx
Index: base/head/lib/libthr/arch/i386/Makefile.inc
===
--- base/head/lib/libthr/arch/i386/Makefile.inc(revision 280703)
+++ base/head/lib/libthr/arch/i386/Makefile.inc(working copy)
@@ -1,3 +1,8 @@
 # $FreeBSD$
 
 SRCS+=_umtx_op_err.S
+
+# Using SSE incurs extra overhead per context switch,
+# which measurably impacts performance when the application
+# does not otherwise use FP/SSE.
+CFLAGS+=-mno-sse -mno-mmx

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: SSE in libthr

2015-03-27 Thread Konstantin Belousov
On Fri, Mar 27, 2015 at 01:49:03PM -0700, Rui Paulo wrote:
 On Mar 27, 2015, at 12:26, Eric van Gyzen vangy...@freebsd.org wrote:
  
  In a nutshell:
  
  Clang emits SSE instructions on amd64 in the common path of
  pthread_mutex_unlock.  This reduces performance by a non-trivial amount.  
  I'd
  like to disable SSE in libthr.
  
  In more detail:
  
  In libthr/thread/thr_mutex.c, we find the following:
  
  #define MUTEX_INIT_LINK(m)  do {\
  (m)-m_qe.tqe_prev = NULL;  \
  (m)-m_qe.tqe_next = NULL;  \
  } while (0)
  
  In 9.1, clang 3.1 emits two ordinary mov instructions:
  
  movq   $0x0,0x8(%rax)
  movq   $0x0,(%rax)
  
  Since 10.0 and clang 3.3, clang emits these SSE instructions:
  
  xorps  %xmm0,%xmm0
  movups %xmm0,(%rax)
  
  Although these look harmless enough, using the FPU can reduce performance by
  incurring extra overhead due to context-switching the FPU state.
  
  As I mentioned, this code is used in the common path of 
  pthread_mutex_unlock.  I
  have a simple test program that creates four threads, all contending for a
  single mutex, and measures the total number of lock acquisitions over 
  several
  seconds.  When libthr is built with SSE, as is current, I get around 53 
  million
  locks in 5 seconds.  Without SSE, I get around 60 million (13% more).  
  DTrace
  shows around 790,000 calls to fpudna versus 10 calls.  There could be other
  factors involved, but I presume that the FPU context switches account for 
  most
  of the change in performance.
  
  Even when I add some SSE usage in the application--incidentally, these same
  instructions--building libthr without SSE improves performance from 53.5 
  million
  to 55.8 million (4.3%).
  
  In the real-world application where I first noticed this, performance 
  improves
  by 3-5%.
  
  I would appreciate your thoughts and feedback.  The proposed patch is below.
  
  Eric
  
  
  
  Index: base/head/lib/libthr/arch/amd64/Makefile.inc
  ===
  --- base/head/lib/libthr/arch/amd64/Makefile.inc(revision 280703)
  +++ base/head/lib/libthr/arch/amd64/Makefile.inc(working copy)
  @@ -1,3 +1,8 @@
  #$FreeBSD$
  
  SRCS+=  _umtx_op_err.S
  +
  +# Using SSE incurs extra overhead per context switch,
  +# which measurably impacts performance when the application
  +# does not otherwise use FP/SSE.
  +CFLAGS+=-mno-sse
 
 Good catch!
 
 Regarding your patch, I think we should disable even more, if possible.  How 
 about:
 
 CFLAGS+=-mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3

I think so.

Also, this should be done for libc as well, both on i386 and amd64.
I am not sure, should compiler-rt be included into the set ?
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: SSE in libthr

2015-03-27 Thread Adrian Chadd
hi,

please don't try to microoptimise crap like strlen().

The TL;DR for performant high-throughput code is: if strlen() or
memcpy() is the thing that's costing you the most, you're doing it
wrong.



-adrian
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: SSE in libthr

2015-03-27 Thread Alan Somers
On Fri, Mar 27, 2015 at 4:36 PM, Adrian Chadd adr...@freebsd.org wrote:
 hi,

 please don't try to microoptimise crap like strlen().

 The TL;DR for performant high-throughput code is: if strlen() or
 memcpy() is the thing that's costing you the most, you're doing it
 wrong.



 -adrian

I respectfully disagree.  A well-optimized libc will benefit
_every_single_program_ that uses strlen.  That includes Apache, Samba,
Memcached, Quake, and basically every single program that every single
FreeBSD user uses.  There's no reason that 3rd party software
maintainers should have to rewrite basic libc functions in order to
get decent performance on FreeBSD.  And the downsides are so small!
In 2015, we should assume by default that most userland software is
using SIMD instructions.  As Eric noticed, Clang emits them freely.
What's the point to lazily saving the SSE registers on context
switches if essentially all programs compiled from Ports will be using
those registers anyway?  I agree with Jilles; I think we should always
save the SSE registers for userland programs.

-Alan
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: SSE in libthr

2015-03-27 Thread Adrian Chadd
On 27 March 2015 at 16:03, Alan Somers asom...@freebsd.org wrote:
 On Fri, Mar 27, 2015 at 4:36 PM, Adrian Chadd adr...@freebsd.org wrote:
 hi,

 please don't try to microoptimise crap like strlen().

 The TL;DR for performant high-throughput code is: if strlen() or
 memcpy() is the thing that's costing you the most, you're doing it
 wrong.



 -adrian

 I respectfully disagree.  A well-optimized libc will benefit
 _every_single_program_ that uses strlen.  That includes Apache, Samba,
 Memcached, Quake, and basically every single program that every single
 FreeBSD user uses.  There's no reason that 3rd party software
 maintainers should have to rewrite basic libc functions in order to
 get decent performance on FreeBSD.  And the downsides are so small!
 In 2015, we should assume by default that most userland software is
 using SIMD instructions.  As Eric noticed, Clang emits them freely.
 What's the point to lazily saving the SSE registers on context
 switches if essentially all programs compiled from Ports will be using
 those registers anyway?  I agree with Jilles; I think we should always
 save the SSE registers for userland programs.

That's fine, but those benchmarks and improvements also have to take
into account the environment that these programs are running in, and
all of the other things that are going on with it.

Fixing strlen() to use SSE2 is great, but if the gains are offset by
fpu save/restore when doing fine grain locking that's blocking under
real world workloads, what's the benefit? What about if the system is
context switching over a million times a second? These are real life
things I see servers running all of the above software /do/.

One only knows with benchmarking, not microbenchmarking.

Microbenchmarks are great. They serve a purpose, which is how the
heck is the current silicon I'm running on run some code that I've
cleverly crafted to hopefully run well.

I'm totally for saving/restoring SSE registers for userland programs.
But that's not where that kind of make stuff fast work should stop.
If it does, and that's where your benchmarking for the real world
stops, then you're doing it wrong.

Everything is a toss-up. For this userland based netmap packet pushing
app, SEE may be nice for some instructions, but know what else screws
things? The fact that the default scheduler policy is terrible and
crap gets scheduled /everywhere/ under any appreciable amount of load.
That the context switch rate is high, the interrupt rate is also high,
and with a little locking going on, I see fpu save/restore occur for a
non-insignificant fraction of CPU. Optimising strlen() or memcpy() is
great, but when my system context switches a million times a second,
we're never going to reach the steady state that these CPUs can really
crank out real work at under those conditions.

So, cool. Please keep poking at that stuff. But if you stop short of
making the system actually /be able to take advantage of them under
load/, I respectfully ask for a nice knob I can use to turn them off.
:)



-adrian

(Know where the slowdowns for memcached are? Hint - not strlen or
memcpy. Yes, I've been down that rabbit hole recently. Know what /i/
have? 1 million UDP transactions a second working on 16 core
sandybridge systems. Know what I didn't optimise? memcpy or strlen.
The network stack locking and pthreads overhead is what sucks.)
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: SSE in libthr

2015-03-27 Thread Tomoaki AOKI
Possibly related information.

Recently, I tried to build world/kernel (head, r280410, amd64) with
CPUTYPE setting in make.conf.  Real CPU is sandybridge (corei7-avx).

Running in VirtualBox VM, installworld fails with CPUTYPE?=corei7-avx,
while with CPUTYPE?=corei7 everything goes OK.
 *Rebooting after installkernel and etcupdate -p goes OK, but rebooting
  after failed installworld causes even /bin/sh fail to start (kernel
  starts OK).

Yes, it would be the problem (or limitation) of VirtualBox and NOT of
FreeBSD, as memstick image built from /usr/obj with CPUTYPE?=corei7-avx
runs OK in real hardware. This should mean some AVX instructions are
generated by clang 3.6.0 for userland, and VirtualBox doesn't like them.


On Fri, 27 Mar 2015 15:26:17 -0400
Eric van Gyzen vangy...@freebsd.org wrote:

 In a nutshell:
 
 Clang emits SSE instructions on amd64 in the common path of
 pthread_mutex_unlock.  This reduces performance by a non-trivial amount.  I'd
 like to disable SSE in libthr.
 
 In more detail:
 
 In libthr/thread/thr_mutex.c, we find the following:
 
   #define MUTEX_INIT_LINK(m)  do {\
   (m)-m_qe.tqe_prev = NULL;  \
   (m)-m_qe.tqe_next = NULL;  \
   } while (0)
 
 In 9.1, clang 3.1 emits two ordinary mov instructions:
 
   movq   $0x0,0x8(%rax)
   movq   $0x0,(%rax)
 
 Since 10.0 and clang 3.3, clang emits these SSE instructions:
 
   xorps  %xmm0,%xmm0
   movups %xmm0,(%rax)
 
 Although these look harmless enough, using the FPU can reduce performance by
 incurring extra overhead due to context-switching the FPU state.
 
 As I mentioned, this code is used in the common path of pthread_mutex_unlock. 
  I
 have a simple test program that creates four threads, all contending for a
 single mutex, and measures the total number of lock acquisitions over several
 seconds.  When libthr is built with SSE, as is current, I get around 53 
 million
 locks in 5 seconds.  Without SSE, I get around 60 million (13% more).  DTrace
 shows around 790,000 calls to fpudna versus 10 calls.  There could be other
 factors involved, but I presume that the FPU context switches account for most
 of the change in performance.
 
 Even when I add some SSE usage in the application--incidentally, these same
 instructions--building libthr without SSE improves performance from 53.5 
 million
 to 55.8 million (4.3%).
 
 In the real-world application where I first noticed this, performance improves
 by 3-5%.
 
 I would appreciate your thoughts and feedback.  The proposed patch is below.
 
 Eric
 
 
 
 Index: base/head/lib/libthr/arch/amd64/Makefile.inc
 ===
 --- base/head/lib/libthr/arch/amd64/Makefile.inc  (revision 280703)
 +++ base/head/lib/libthr/arch/amd64/Makefile.inc  (working copy)
 @@ -1,3 +1,8 @@
  #$FreeBSD$
 
  SRCS+=   _umtx_op_err.S
 +
 +# Using SSE incurs extra overhead per context switch,
 +# which measurably impacts performance when the application
 +# does not otherwise use FP/SSE.
 +CFLAGS+=-mno-sse
 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
 


-- 
Tomoaki AOKIjunch...@dec.sakura.ne.jp
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org