Re: [Numpy-discussion] adding fused multiply and add to numpy

2014-01-09 Thread Neal Becker
Charles R Harris wrote:

 On Wed, Jan 8, 2014 at 2:39 PM, Julian Taylor jtaylor.deb...@googlemail.com
 wrote:
 
...
 
 Another function that could be useful is a |a|**2 function, abs2 perhaps.
 
 Chuck

I use mag_sqr all the time.  It should be much faster for complex, if computed
via:

x.real**2 + x.imag**2

avoiding the sqrt of abs.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] adding fused multiply and add to numpy

2014-01-09 Thread Freddie Witherden
On 08/01/14 21:39, Julian Taylor wrote:
 An issue is software emulation of real fma. This can be enabled in the
 test ufunc with npfma.set_type(libc).
 This is unfortunately incredibly slow about a factor 300 on my machine
 without hardware fma.
 This means we either have a function that is fast on some platforms and
 slow on others but always gives the same result or we have a fast
 function that gives better results on some platforms.
 Given that we are not worth that what numpy currently provides I favor
 the latter.
 
 Any opinions on whether this should go into numpy or maybe stay a third
 party ufunc?

My preference would be to initially add an madd intrinsic.  This can
be supported on all platforms and can be documented to permit the use of
FMA where available.

A 'true' FMA intrinsic function should only be provided when hardware
FMA support is available.  Many of the more interesting applications of
FMA depend on there only being a single rounding step and as such FMA
should probably mean a*b + c with only a single rounding.

Regards, Freddie.



signature.asc
Description: OpenPGP digital signature
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] adding fused multiply and add to numpy

2014-01-09 Thread Frédéric Bastien
Hi,

It happen frequently that NumPy isn't compiled with all instruction
that is available where it run. For example in distro. So if the
decision is made to use the fast version when we don't use the newer
instruction, the user need a way to know that. So the library need a
function/attribute to tell that.

How hard would it be to provide the choise to the user? We could
provide 2 functions like: fma_fast() fma_prec() (for precision)? Or
this could be a parameter or a user configuration option like for the
overflow/underflow error.

Fred

On Thu, Jan 9, 2014 at 9:43 AM, Freddie Witherden fred...@witherden.org wrote:
 On 08/01/14 21:39, Julian Taylor wrote:
 An issue is software emulation of real fma. This can be enabled in the
 test ufunc with npfma.set_type(libc).
 This is unfortunately incredibly slow about a factor 300 on my machine
 without hardware fma.
 This means we either have a function that is fast on some platforms and
 slow on others but always gives the same result or we have a fast
 function that gives better results on some platforms.
 Given that we are not worth that what numpy currently provides I favor
 the latter.

 Any opinions on whether this should go into numpy or maybe stay a third
 party ufunc?

 My preference would be to initially add an madd intrinsic.  This can
 be supported on all platforms and can be documented to permit the use of
 FMA where available.

 A 'true' FMA intrinsic function should only be provided when hardware
 FMA support is available.  Many of the more interesting applications of
 FMA depend on there only being a single rounding step and as such FMA
 should probably mean a*b + c with only a single rounding.

 Regards, Freddie.


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] adding fused multiply and add to numpy

2014-01-09 Thread Daπid
On 8 January 2014 22:39, Julian Taylor jtaylor.deb...@googlemail.comwrote:

 As you can see even without real hardware support it is about 30% faster
 than inplace unblocked numpy due better use of memory bandwidth. Its
 even more than two times faster than unoptimized numpy.


I have an i5, and AVX crashes, even though it is supported by my CPU.
Here are my timings:

SSE2:

In [24]: %timeit npfma.fma(a, b, c)
10 loops, best of 3: 15 us per loop

In [28]: %timeit npfma.fma(a, b, c)
100 loops, best of 3: 2.36 ms per loop

In [29]: %timeit npfma.fms(a, b, c)
100 loops, best of 3: 2.36 ms per loop

In [31]: %timeit pure_numpy_fma(a, b, c)
100 loops, best of 3: 7.5 ms per loop

In [33]: %timeit pure_numpy_fma2(a, b, c)
100 loops, best of 3: 4.41 ms per loop

The model supports all the way to sse4_2

libc:

In [24]: %timeit npfma.fma(a, b, c)
1000 loops, best of 3: 883 us per loop

In [28]: %timeit npfma.fma(a, b, c)
10 loops, best of 3: 88.7 ms per loop

In [29]: %timeit npfma.fms(a, b, c)
10 loops, best of 3: 87.4 ms per loop

In [31]: %timeit pure_numpy_fma(a, b, c)
100 loops, best of 3: 7.94 ms per loop

In [33]: %timeit pure_numpy_fma2(a, b, c)
100 loops, best of 3: 3.03 ms per loop



 If you have a machine capable of fma instructions give it a spin to see
 if you get similar or better results. Please verify the assembly
 (objdump -d fma-suffix.o) to check if the compiler properly used the
 machine fma.


Following the instructions in the readme, there is only one compiled file,
npfma.so, but no .o.


/David.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] adding fused multiply and add to numpy

2014-01-09 Thread Julian Taylor
On Thu, Jan 9, 2014 at 3:54 PM, Daπid davidmen...@gmail.com wrote:


 On 8 January 2014 22:39, Julian Taylor jtaylor.deb...@googlemail.comwrote:

 As you can see even without real hardware support it is about 30% faster
 than inplace unblocked numpy due better use of memory bandwidth. Its
 even more than two times faster than unoptimized numpy.


 I have an i5, and AVX crashes, even though it is supported by my CPU.


I forgot about the 32 byte alignment avx (as it is used in this code)
requires. I pushed a new version that takes care of it.
It should now work with avx.


 Following the instructions in the readme, there is only one compiled file,
 npfma.so, but no .o.


 the .o files are in the build/ subfolder
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] adding fused multiply and add to numpy

2014-01-09 Thread Julian Taylor
On Thu, Jan 9, 2014 at 3:50 PM, Frédéric Bastien no...@nouiz.org wrote:

 Hi,

 It happen frequently that NumPy isn't compiled with all instruction
 that is available where it run. For example in distro. So if the
 decision is made to use the fast version when we don't use the newer
 instruction, the user need a way to know that. So the library need a
 function/attribute to tell that.


As these instructions are very new runtime cpu feature detection is
required. That way also distribution users get the fast code if their cpu
supports it.




 How hard would it be to provide the choise to the user? We could
 provide 2 functions like: fma_fast() fma_prec() (for precision)? Or
 this could be a parameter or a user configuration option like for the
 overflow/underflow error.


I like Freddie Witherden proposal to name the function madd which does not
guarantee one rounding operation. This leaves the namespace open for a
special fma function with that guarantee. It can use the libc fma function
which is very slow sometimes but platform independent. This is assuming
apple did not again take shortcuts like they did with their libc hypot
implementation, can someone disassemble apple libc to check what they are
doing for C99 fma?
And it leaves users the possibility to use the faster madd function if they
do not need the precision guarantee.

Another option would be a precision context manager which tells numpy which
variant to use. This would also be useful for other code (like
abs/hypot/abs2/sum/reciprocal sqrt) but probably it involves more work.

with numpy.precision_mode('fast'):
  ... # allow no fma, use fast hypot, fast sum, ignore overflow/invalid
errors

with numpy.precision_mode('precise'):
  ... # require fma, use precise hypot, use exact summation (math.fsum) or
at least kahan summation, full overflow/invalid checks etc





 On Thu, Jan 9, 2014 at 9:43 AM, Freddie Witherden fred...@witherden.org
 wrote:
  On 08/01/14 21:39, Julian Taylor wrote:
  An issue is software emulation of real fma. This can be enabled in the
  test ufunc with npfma.set_type(libc).
  This is unfortunately incredibly slow about a factor 300 on my machine
  without hardware fma.
  This means we either have a function that is fast on some platforms and
  slow on others but always gives the same result or we have a fast
  function that gives better results on some platforms.
  Given that we are not worth that what numpy currently provides I favor
  the latter.
 
  Any opinions on whether this should go into numpy or maybe stay a third
  party ufunc?
 
  My preference would be to initially add an madd intrinsic.  This can
  be supported on all platforms and can be documented to permit the use of
  FMA where available.
 
  A 'true' FMA intrinsic function should only be provided when hardware
  FMA support is available.  Many of the more interesting applications of
  FMA depend on there only being a single rounding step and as such FMA
  should probably mean a*b + c with only a single rounding.
 
  Regards, Freddie.
 
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] adding fused multiply and add to numpy

2014-01-09 Thread Nathaniel Smith
On Thu, Jan 9, 2014 at 3:30 PM, Julian Taylor
jtaylor.deb...@googlemail.com wrote:
 On Thu, Jan 9, 2014 at 3:50 PM, Frédéric Bastien no...@nouiz.org wrote:
 How hard would it be to provide the choise to the user? We could
 provide 2 functions like: fma_fast() fma_prec() (for precision)? Or
 this could be a parameter or a user configuration option like for the
 overflow/underflow error.

 I like Freddie Witherden proposal to name the function madd which does not
 guarantee one rounding operation. This leaves the namespace open for a
 special fma function with that guarantee. It can use the libc fma function
 which is very slow sometimes but platform independent. This is assuming
 apple did not again take shortcuts like they did with their libc hypot
 implementation, can someone disassemble apple libc to check what they are
 doing for C99 fma?
 And it leaves users the possibility to use the faster madd function if they
 do not need the precision guarantee.

If madd doesn't provide any rounding guarantees, then its only reason
for existence is that it provides a fused a*b+c loop that better
utilizes memory bandwidth, right? I'm guessing that speed-wise it
doesn't really matter whether you use the fancy AVX instructions or
not, since even the naive implementation is memory bound -- the
advantage is just in the fusion?

Lack of loop fusion is obviously a major limitation of numpy, but it's
a very general problem. I'm sceptical about whether we want to get
into the business of adding functions whose only purpose is to provide
pre-fused loops. After madd, what other operations should we provide
like this? msub (a*b-c)? add3 (a+b+c)? maddm (a*b+c*d)? mult3 (a*b*c)?
How do we decide? Surely it's better to direct people who are hitting
memory bottlenecks to much more powerful and general solutions to this
problem, like numexpr/cython/numba/theano?

(OTOH the verison that gives rounding guarantees is obviously a unique
new feature.)

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] adding fused multiply and add to numpy

2014-01-09 Thread Frédéric Bastien
Good questions where do we stop.

I think as you that the fma with guarantees is a good new feature. But
if this is made available, people will want to use it for speed. Some
people won't like to use another library or dependency. They won't
like to have random speed up or slow down. So why not add the ma and
fma and trace the line to the operation implemented on the CPU that
have an fused version? That will make a sensible limit I think.

Anyway, we won't use it directly. This is just my taught.

Do you know if those instruction are automatically used by gcc if we
use the good architecture parameter?


Fred

On Thu, Jan 9, 2014 at 12:07 PM, Nathaniel Smith n...@pobox.com wrote:
 On Thu, Jan 9, 2014 at 3:30 PM, Julian Taylor
 jtaylor.deb...@googlemail.com wrote:
 On Thu, Jan 9, 2014 at 3:50 PM, Frédéric Bastien no...@nouiz.org wrote:
 How hard would it be to provide the choise to the user? We could
 provide 2 functions like: fma_fast() fma_prec() (for precision)? Or
 this could be a parameter or a user configuration option like for the
 overflow/underflow error.

 I like Freddie Witherden proposal to name the function madd which does not
 guarantee one rounding operation. This leaves the namespace open for a
 special fma function with that guarantee. It can use the libc fma function
 which is very slow sometimes but platform independent. This is assuming
 apple did not again take shortcuts like they did with their libc hypot
 implementation, can someone disassemble apple libc to check what they are
 doing for C99 fma?
 And it leaves users the possibility to use the faster madd function if they
 do not need the precision guarantee.

 If madd doesn't provide any rounding guarantees, then its only reason
 for existence is that it provides a fused a*b+c loop that better
 utilizes memory bandwidth, right? I'm guessing that speed-wise it
 doesn't really matter whether you use the fancy AVX instructions or
 not, since even the naive implementation is memory bound -- the
 advantage is just in the fusion?

 Lack of loop fusion is obviously a major limitation of numpy, but it's
 a very general problem. I'm sceptical about whether we want to get
 into the business of adding functions whose only purpose is to provide
 pre-fused loops. After madd, what other operations should we provide
 like this? msub (a*b-c)? add3 (a+b+c)? maddm (a*b+c*d)? mult3 (a*b*c)?
 How do we decide? Surely it's better to direct people who are hitting
 memory bottlenecks to much more powerful and general solutions to this
 problem, like numexpr/cython/numba/theano?

 (OTOH the verison that gives rounding guarantees is obviously a unique
 new feature.)

 -n
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] adding fused multiply and add to numpy

2014-01-09 Thread Julian Taylor
On 10.01.2014 01:49, Frédéric Bastien wrote:
 
 Do you know if those instruction are automatically used by gcc if we
 use the good architecture parameter?
 
 

they are used if you enable -ffp-contract=fast. Do not set it to `on`
this is an alias to `off` due to the semantics of C.
-ffast-math enables in in gcc 4.7 and 4.8 but not in 4.9 but this might
be a bug, I filed one a while ago.

Also you need to set the -mfma or -arch=bdver{1,2,3,4}. Its not part of
-mavx2 last I checked.

But there are not many places in numpy the compiler can use it, only dot
comes to mind which goes over blas libraries in the high performance case.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion