Re: [Info], Add suport for PowerPC IEEE 128-bit floating point

2014-07-17 Thread Michael Meissner
On Tue, Jul 15, 2014 at 04:50:33PM -0500, Segher Boessenkool wrote:
 On Tue, Jul 15, 2014 at 05:20:31PM -0400, Michael Meissner wrote:
  I did some timing tests to compare the new PowerPC IEEE 128-bit results to 
  the
  current implementation of long double using the IBM extended format.
  
  The test consisted a short loop doing the operation over arrays of 1,024
  elements, reading in two values, doing the operation, and then storing it 
  back.
  This loop in turn was done multiple times, with the idea that most of the
  values would be in the cache, and we didn't have to worry about 
  pre-fetching,
  etc.
  
  The float, double tests were done with vectorization disabled, while the 
  vector
  float and vector double tests, the compiler was allowed to do the normal 
  auto
  vectorization.
  
  The number reported was how much longer the second column took over the 
  first:
 
 I assume you mean the other way around?
 
  Generally, the __float128 is 2x slower than the current IBM extended double
  format, except for divide, where it is 5x slower.  I must say, the software
  floating point emulation routines worked well, and once the proper macros 
  were
  setup, I only needed to override the type used for IEEE 128-bit.
  
  Add loop
  
  
  float   vs double:  2.00x
 
 Why is float twice as slow as double?

Pat re-ran the tests, and now float/double are the same speed.  Since I was
running this on a development machine, and not a dedicated machine, it was
probably just luck of the draw that somebody was doing a large build at the
time I ran the tests.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meiss...@linux.vnet.ibm.com, phone: +1 (978) 899-4797



Re: [Info], Add suport for PowerPC IEEE 128-bit floating point

2014-07-16 Thread Michael Meissner
On Tue, Jul 15, 2014 at 04:50:33PM -0500, Segher Boessenkool wrote:
 On Tue, Jul 15, 2014 at 05:20:31PM -0400, Michael Meissner wrote:
  I did some timing tests to compare the new PowerPC IEEE 128-bit results to 
  the
  current implementation of long double using the IBM extended format.
  
  The test consisted a short loop doing the operation over arrays of 1,024
  elements, reading in two values, doing the operation, and then storing it 
  back.
  This loop in turn was done multiple times, with the idea that most of the
  values would be in the cache, and we didn't have to worry about 
  pre-fetching,
  etc.
  
  The float, double tests were done with vectorization disabled, while the 
  vector
  float and vector double tests, the compiler was allowed to do the normal 
  auto
  vectorization.
  
  The number reported was how much longer the second column took over the 
  first:
 
 I assume you mean the other way around?
 
  Generally, the __float128 is 2x slower than the current IBM extended double
  format, except for divide, where it is 5x slower.  I must say, the software
  floating point emulation routines worked well, and once the proper macros 
  were
  setup, I only needed to override the type used for IEEE 128-bit.
  
  Add loop
  
  
  float   vs double:  2.00x
 
 Why is float twice as slow as double?

I'm not sure, and the Book IV's claims that lfsu vs. ldsu, fadds vs. fadd, and
stfsu vs. stfdu are the same number of cycles as each other.  The inner loop is
fairly simple:

;; Float loop

.p2align 6,,63
.L460:
lfsu 12,4(10)
lfsu 0,4(8)
fadds 0,12,0
stfsu 0,4(9)
bdnz .L460

;; Double loop

.p2align 6,,63
.L466:
lfdu 12,8(10)
lfdu 0,8(8)
fadd 0,12,0
stfdu 0,8(9)
bdnz .L466

I would suspect that given internally the PowerPC keeps scalar floating point
in double format in the registers probably accounts for slow downs in a tight
loop (i.e. lfsu must load and convert the value to double, fadds must do the
add and then round to float precision, and stfsu must convert the double to
float format).  We've also seen cases where load with update is slower than
doing the instructions separately.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meiss...@linux.vnet.ibm.com, phone: +1 (978) 899-4797



Re: [Info], Add suport for PowerPC IEEE 128-bit floating point

2014-07-15 Thread Michael Meissner
I did some timing tests to compare the new PowerPC IEEE 128-bit results to the
current implementation of long double using the IBM extended format.

The test consisted a short loop doing the operation over arrays of 1,024
elements, reading in two values, doing the operation, and then storing it back.
This loop in turn was done multiple times, with the idea that most of the
values would be in the cache, and we didn't have to worry about pre-fetching,
etc.

The float, double tests were done with vectorization disabled, while the vector
float and vector double tests, the compiler was allowed to do the normal auto
vectorization.

The number reported was how much longer the second column took over the first:

Generally, the __float128 is 2x slower than the current IBM extended double
format, except for divide, where it is 5x slower.  I must say, the software
floating point emulation routines worked well, and once the proper macros were
setup, I only needed to override the type used for IEEE 128-bit.

Add loop


float   vs double:  2.00x
float   vs vector float:4.97x
double  vs vector double:   2.63x
long double vs double: 16.85x
__float128  vs double: 23.34x
__float128  vs long double: 1.39x

Subtract loop
=

float   vs double:  1.99x
float   vs vector float:4.66x
double  vs vector double:   2.63x
long double vs double: 14.47x
__float128  vs double: 27.65x
__float128  vs long double: 1.91x

Multiply loop
=

float   vs double:  2.05x
float   vs vector float:5.18x
double  vs vector double:   2.59x
long double vs double: 11.58x
__float128  vs double: 27.44x
__float128  vs long double: 2.37x

Divide loop
===

float   vs double:  0.82x
float   vs vector float:2.11x
double  vs vector double:   2.00x
long double vs double:  5.90x
__float128  vs double: 34.57x
__float128  vs long double: 5.86x

Maximum via comparison and ?:
=

float   vs double:  1.74x
float   vs vector float:4.62x
double  vs vector double:   2.62x
long double vs double:  5.07x
__float128  vs double: 18.02x
__float128  vs long double: 3.55x

Minimum via comparison and ?:
=

float   vs double:  1.74x
float   vs vector float:4.52x
double  vs vector double:   2.62x
long double vs double:  5.38x
__float128  vs double: 15.14x
__float128  vs long double: 2.82x



-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meiss...@linux.vnet.ibm.com, phone: +1 (978) 899-4797



Re: [Info], Add suport for PowerPC IEEE 128-bit floating point

2014-07-15 Thread Segher Boessenkool
On Tue, Jul 15, 2014 at 05:20:31PM -0400, Michael Meissner wrote:
 I did some timing tests to compare the new PowerPC IEEE 128-bit results to the
 current implementation of long double using the IBM extended format.
 
 The test consisted a short loop doing the operation over arrays of 1,024
 elements, reading in two values, doing the operation, and then storing it 
 back.
 This loop in turn was done multiple times, with the idea that most of the
 values would be in the cache, and we didn't have to worry about pre-fetching,
 etc.
 
 The float, double tests were done with vectorization disabled, while the 
 vector
 float and vector double tests, the compiler was allowed to do the normal auto
 vectorization.
 
 The number reported was how much longer the second column took over the first:

I assume you mean the other way around?

 Generally, the __float128 is 2x slower than the current IBM extended double
 format, except for divide, where it is 5x slower.  I must say, the software
 floating point emulation routines worked well, and once the proper macros were
 setup, I only needed to override the type used for IEEE 128-bit.
 
 Add loop
 
 
 float   vs double:  2.00x

Why is float twice as slow as double?


Segher