Re: svn commit: r290014 - in stable/10: lib/libthr/arch/amd64 lib/libthr/arch/i386 libexec/rtld-elf/amd64 libexec/rtld-elf/i386 share/mk

2015-12-18 Thread Eric van Gyzen
David,

I apologize for the slow reply.  Your message went to my "stable" box,
which I read less often.

On 11/14/2015 12:30, David Chisnall wrote:
> On 26 Oct 2015, at 16:21, Eric van Gyzen 
> wrote:
>> 
>> One counter-argument to this change is that most applications
>> already use SIMD, and the number of applications and amount of SIMD
>> usage are only increasing.
> 
> Note that SSE and SIMD are not the same thing.  The x86-64 ABI uses
> SSE registers for floating point arguments, so even a purely scalar
> application that uses floating point will end up faulting in the SSE
> state.

I'm aware.  Using the term "SIMD" was an admittedly weak attempt to be
platform agnostic.

> I believe that the no-sse option for clang is ABI-preserving, so will
> not actually disable all SSE unless you also specify -msoft-float.

I'm afraid that's not the case:

$ cat square.c
double square(double x) { return (x*x); }

$ clang -mno-sse -c square.c
fatal error: error in backend: SSE register return with SSE disabled
clang: error: clang frontend command failed with exit code 70 (use -v to
see invocation)
FreeBSD clang version 3.7.0 (tags/RELEASE_370/final 246257) 20150906
Target: x86_64-unknown-freebsd11.0
[snip]

Shall I file the bug report, as it suggests?

> I don’t think that libthr uses floating point anywhere, but libc does
> and you only need to call one function that takes a floating point
> argument in between context switches to lose this gain on x86-64.
> With this change, we’re making the compiler emit less efficient code,
> on the assumption that nothing will touch the fpu in the quantum
> before the next context switch.  I’d really like to see the set of
> applications that you benchmarked the change with on x86-64 to reach
> the conclusion that this is a net win overall.
>
> Or, to put it another way: How many applications are multithreaded
> but don’t use any floating point code?

If I showed you the applications that I care about the most, I would
risk losing my job.  When we updated from FreeBSD 9 to 10, we measured a
significant loss in performance.  This was due to multiple factors, one
of which was that clang started using SSE widely.  We were not yet using
that version of clang for our own code, so most of the performance loss
was due to the usage of SSE in libthr.  Using -mno-sse restored the lost
performance.  It's possible that we lost performance due to SSE in other
libraries; I haven't pursued this.

These applications only use floating-point in some rare corners of
management code, not in any performance-sensitive paths.  They also
don't use libc very much.

On a recent head, I used this script

https://people.freebsd.org/~vangyzen/thr_sse/thr_sse_file_line.sh

to generate this list

https://people.freebsd.org/~vangyzen/thr_sse/thr_sse_file_line.txt

of line numbers in libthr that use SSE.  I manually reviewed those to
write this list:

https://people.freebsd.org/~vangyzen/thr_sse/thr_sse_uses.txt

The vast majority of these simply aren't interesting, because they would
not be called in a performance-sensitive code path, or the code that
uses SSE pales in comparison to the weight of the surrounding code.

The only one that I find truly interesting is mutex_unlock_common(),
which uses SSE to NULL two pointers in the "fast path", which is rather
lightweight.  So, I wrote this nanobenchmark

https://people.freebsd.org/~vangyzen/thr_sse/movups/

to measure the effect of using SSE in such a way.  I ran it on five
machines and got these results:

https://people.freebsd.org/~vangyzen/thr_sse/movups/summary.txt

As you can see, most of them show no significant difference.  One
machine, however, showed a 16.7% improvement with SSE.  I find this
fascinating, and I honestly can't explain it.  As always, I welcome
feedback.

I then wrote this /slightly/ more realistic microbenchmark

https://people.freebsd.org/~vangyzen/thr_sse/mutex_bench/

which uses pthread_mutex_unlock and therefore mutex_unlock_common.  I
ran it on /that/ machine.  I got these results:

https://people.freebsd.org/~vangyzen/thr_sse/mutex_bench/summary.txt

When libthr was compiled without SSE, the throughput was improved by
7.25%.  Performance of a real-world application improved 3-5%.

I honestly don't like the change any more than you do.  I committed it
just because it helped us measurably, it might help others, and I doubt
it hurts anybody.  If that last point is disproven, I'll be happy to
revert it.  Now, I look forward to a lively discussion.  :)

Eric
___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Re: svn commit: r290014 - in stable/10: lib/libthr/arch/amd64 lib/libthr/arch/i386 libexec/rtld-elf/amd64 libexec/rtld-elf/i386 share/mk

2015-11-15 Thread Konstantin Belousov
On Sat, Nov 14, 2015 at 06:30:13PM +, David Chisnall wrote:
> On 26 Oct 2015, at 16:21, Eric van Gyzen  wrote:
> > 
> > One counter-argument to this change is that most applications already
> >  use SIMD, and the number of applications and amount of SIMD usage
> >  are only increasing.
> 
> Note that SSE and SIMD are not the same thing.  The x86-64 ABI uses SSE 
> registers for floating point arguments, so even a purely scalar application 
> that uses floating point will end up faulting in the SSE state.  This is not 
> the case on IA32, where x87 registers are used (though when compiling for 
> i686, SSE is used by default because register allocation for x87 is a huge 
> pain).
> 

Is it ?  If SSE is used on i686 (AKA >= Pentium Pro) by default,
this is a huge bug.

___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r290014 - in stable/10: lib/libthr/arch/amd64 lib/libthr/arch/i386 libexec/rtld-elf/amd64 libexec/rtld-elf/i386 share/mk

2015-11-15 Thread Bruce Evans

On Sun, 15 Nov 2015, Konstantin Belousov wrote:


On Sat, Nov 14, 2015 at 06:30:13PM +, David Chisnall wrote:

On 26 Oct 2015, at 16:21, Eric van Gyzen  wrote:


One counter-argument to this change is that most applications already
 use SIMD, and the number of applications and amount of SIMD usage
 are only increasing.


Note that SSE and SIMD are not the same thing.  The x86-64 ABI uses SSE 
registers for floating point arguments, so even a purely scalar application 
that uses floating point will end up faulting in the SSE state.  This is not 
the case on IA32, where x87 registers are used (though when compiling for i686, 
SSE is used by default because register allocation for x87 is a huge pain).


Is it ?  If SSE is used on i686 (AKA >= Pentium Pro) by default,
this is a huge bug.


clang is not as broken as that.  It needs excessive setting of -march to
get SSE instructions and of course a runtime arch that has SSE to execute
them.  I usually see it by forcing -march=core2 or -march=native on a
host arch that has SSE.

Using SSE instead of x87 on i386 is a usually a small pessimization except
in large functions where the x87 register set is too small or non-scalar
SSE can be used, since the i386 ABI requires returning results in x87
registers and the conversions between SSE and x87 for this have large
latency.  But clang doesn't understand the x87 very well, so it tends to
be faster using SSE despite this.  Strangely, it appears to understand the
x87 better on amd64 than on i386 -- better than gcc on amd64, but worse
than gcc on i386.  I think this is mostly because to kill SSE for
arithmetic on i386 on arches that support it, all use of SSE must be
killed using -mno-sse or -march=lower.  -mfpmath=387 to give fine control
of this is still broken in clang: -march=i386 -mfpmath=387 works, but
-march=core2 -mfpmath=387 fails with "'387' not supported.  Similarly for
any arch that supports sse, or with -mfpmath=sse on an arch where clang
wants to use x87.

gcc supports -mfpmath=387 even on amd64.  This is just slower, and usually
not more accurate, since the ABI forces conversons on function return.

Bruce
___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r290014 - in stable/10: lib/libthr/arch/amd64 lib/libthr/arch/i386 libexec/rtld-elf/amd64 libexec/rtld-elf/i386 share/mk

2015-11-14 Thread David Chisnall
On 26 Oct 2015, at 16:21, Eric van Gyzen  wrote:
> 
> One counter-argument to this change is that most applications already
>  use SIMD, and the number of applications and amount of SIMD usage
>  are only increasing.

Note that SSE and SIMD are not the same thing.  The x86-64 ABI uses SSE 
registers for floating point arguments, so even a purely scalar application 
that uses floating point will end up faulting in the SSE state.  This is not 
the case on IA32, where x87 registers are used (though when compiling for i686, 
SSE is used by default because register allocation for x87 is a huge pain).

I believe that the no-sse option for clang is ABI-preserving, so will not 
actually disable all SSE unless you also specify -msoft-float.  I don’t think 
that libthr uses floating point anywhere, but libc does and you only need to 
call one function that takes a floating point argument in between context 
switches to lose this gain on x86-64.  With this change, we’re making the 
compiler emit less efficient code, on the assumption that nothing will touch 
the fpu in the quantum before the next context switch.  I’d really like to see 
the set of applications that you benchmarked the change with on x86-64 to reach 
the conclusion that this is a net win overall. 

Or, to put it another way: How many applications are multithreaded but don’t 
use any floating point code?

David

___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Re: svn commit: r290014 - in stable/10: lib/libthr/arch/amd64 lib/libthr/arch/i386 libexec/rtld-elf/amd64 libexec/rtld-elf/i386 share/mk

2015-11-12 Thread Bryan Drewery
On 10/26/2015 9:21 AM, Eric van Gyzen wrote:
> Author: vangyzen
> Date: Mon Oct 26 16:21:56 2015
> New Revision: 290014
> URL: https://svnweb.freebsd.org/changeset/base/290014
> 
> Log:
>   Disable SSE in libthr

Please also mention 'MFC rREV'. Thanks!

-- 
Regards,
Bryan Drewery



signature.asc
Description: OpenPGP digital signature


Re: svn commit: r290014 - in stable/10: lib/libthr/arch/amd64 lib/libthr/arch/i386 libexec/rtld-elf/amd64 libexec/rtld-elf/i386 share/mk

2015-11-12 Thread Eric van Gyzen

On 11/12/15 5:01 PM, Bryan Drewery wrote:

On 10/26/2015 9:21 AM, Eric van Gyzen wrote:

Author: vangyzen
Date: Mon Oct 26 16:21:56 2015
New Revision: 290014
URL: https://svnweb.freebsd.org/changeset/base/290014

Log:
   Disable SSE in libthr


Please also mention 'MFC rREV'. Thanks!


Will do.  On this commit, I remembered it about 17 ms after
I typed :x.  :-/

Eric
___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


svn commit: r290014 - in stable/10: lib/libthr/arch/amd64 lib/libthr/arch/i386 libexec/rtld-elf/amd64 libexec/rtld-elf/i386 share/mk

2015-10-26 Thread Eric van Gyzen
Author: vangyzen
Date: Mon Oct 26 16:21:56 2015
New Revision: 290014
URL: https://svnweb.freebsd.org/changeset/base/290014

Log:
  Disable SSE in libthr
  
  Clang emits SSE instructions on amd64 in the common path of
  pthread_mutex_unlock.  If the thread does not otherwise use SSE,
  this usage incurs a context-switch of the FPU/SSE state, which
  reduces the performance of multiple real-world applications by a
  non-trivial amount (3-5% in one application).
  
  Instead of this change, I experimented with eagerly switching the
  FPU state at context-switch time.  This did not help.  Most of the
  cost seems to be in the read/write of memory--as kib@ stated--and
  not in the #NM handling.  I tested on machines with and without
  XSAVEOPT.
  
  One counter-argument to this change is that most applications already
  use SIMD, and the number of applications and amount of SIMD usage
  are only increasing.  This is absolutely true.  I agree that--in
  general and in principle--this change is in the wrong direction.
  However, there are applications that do not use enough SSE to offset
  the extra context-switch cost.  SSE does not provide a clear benefit
  in the current libthr code with the current compiler, but it does
  provide a clear loss in some cases.  Therefore, disabling SSE in
  libthr is a non-loss for most, and a gain for some.
  
  I refrained from disabling SSE in libc--as was suggested--because
  I can't make the above argument for libc.  It provides a wide variety
  of code; each case should be analyzed separately.
  
  https://lists.freebsd.org/pipermail/freebsd-current/2015-March/055193.html
  
  Suggestions from:   dim, jmg, rpaulo
  Sponsored by:   Dell Inc.

Modified:
  stable/10/lib/libthr/arch/amd64/Makefile.inc
  stable/10/lib/libthr/arch/i386/Makefile.inc
  stable/10/libexec/rtld-elf/amd64/Makefile.inc
  stable/10/libexec/rtld-elf/i386/Makefile.inc
  stable/10/share/mk/bsd.cpu.mk
Directory Properties:
  stable/10/   (props changed)

Modified: stable/10/lib/libthr/arch/amd64/Makefile.inc
==
--- stable/10/lib/libthr/arch/amd64/Makefile.incMon Oct 26 15:50:39 
2015(r290013)
+++ stable/10/lib/libthr/arch/amd64/Makefile.incMon Oct 26 16:21:56 
2015(r290014)
@@ -1,3 +1,9 @@
 #$FreeBSD$
 
 SRCS+= pthread_md.c _umtx_op_err.S
+
+# With the current compiler and libthr code, using SSE in libthr
+# does not provide enough performance improvement to outweigh
+# the extra context switch cost.  This can measurably impact
+# performance when the application also does not use enough SSE.
+CFLAGS+=${CFLAGS_NO_SIMD}

Modified: stable/10/lib/libthr/arch/i386/Makefile.inc
==
--- stable/10/lib/libthr/arch/i386/Makefile.inc Mon Oct 26 15:50:39 2015
(r290013)
+++ stable/10/lib/libthr/arch/i386/Makefile.inc Mon Oct 26 16:21:56 2015
(r290014)
@@ -1,3 +1,9 @@
 # $FreeBSD$
 
 SRCS+= pthread_md.c _umtx_op_err.S
+
+# With the current compiler and libthr code, using SSE in libthr
+# does not provide enough performance improvement to outweigh
+# the extra context switch cost.  This can measurably impact
+# performance when the application also does not use enough SSE.
+CFLAGS+=${CFLAGS_NO_SIMD}

Modified: stable/10/libexec/rtld-elf/amd64/Makefile.inc
==
--- stable/10/libexec/rtld-elf/amd64/Makefile.inc   Mon Oct 26 15:50:39 
2015(r290013)
+++ stable/10/libexec/rtld-elf/amd64/Makefile.inc   Mon Oct 26 16:21:56 
2015(r290014)
@@ -1,6 +1,6 @@
 # $FreeBSD$
 
-CFLAGS+=   -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 -msoft-float
+CFLAGS+=   ${CFLAGS_NO_SIMD} -msoft-float
 # Uncomment this to build the dynamic linker as an executable instead
 # of a shared library:
 #LDSCRIPT= ${.CURDIR}/${MACHINE_CPUARCH}/elf_rtld.x

Modified: stable/10/libexec/rtld-elf/i386/Makefile.inc
==
--- stable/10/libexec/rtld-elf/i386/Makefile.incMon Oct 26 15:50:39 
2015(r290013)
+++ stable/10/libexec/rtld-elf/i386/Makefile.incMon Oct 26 16:21:56 
2015(r290014)
@@ -1,6 +1,6 @@
 # $FreeBSD$
 
-CFLAGS+=   -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 -msoft-float
+CFLAGS+=   ${CFLAGS_NO_SIMD} -msoft-float
 # Uncomment this to build the dynamic linker as an executable instead
 # of a shared library:
 #LDSCRIPT= ${.CURDIR}/${MACHINE_CPUARCH}/elf_rtld.x

Modified: stable/10/share/mk/bsd.cpu.mk
==
--- stable/10/share/mk/bsd.cpu.mk   Mon Oct 26 15:50:39 2015
(r290013)
+++ stable/10/share/mk/bsd.cpu.mk   Mon Oct 26 16:21:56 2015
(r290014)
@@ -267,6 +267,27 @@ _CPUCFLAGS += -mfloat-abi=softfp
 CFLAGS += ${_CPUCFLAGS}
 .endif