Re: [julia-users] Is FMA/Muladd Working Here?

2016-09-23 Thread Erik Schnetter
It should. Yes, please open an issue.

-erik

On Thu, Sep 22, 2016 at 7:46 PM, Chris Rackauckas 
wrote:

> So, in the end, is `@fastmath` supposed to be adding FMA? Should I open an
> issue?
>
> On Wednesday, September 21, 2016 at 7:11:14 PM UTC-7, Yichao Yu wrote:
>>
>> On Wed, Sep 21, 2016 at 9:49 PM, Erik Schnetter 
>> wrote:
>> > I confirm that I can't get Julia to synthesize a `vfmadd` instruction
>> > either... Sorry for sending you on a wild goose chase.
>>
>> -march=haswell does the trick for C (both clang and gcc)
>> the necessary bit for the machine ir optimization (this is not a llvm
>> ir optimization pass) to do this is llc options -mcpu=haswell and
>> function attribute unsafe-fp-math=true.
>>
>> >
>> > -erik
>> >
>> > On Wed, Sep 21, 2016 at 9:33 PM, Yichao Yu  wrote:
>> >>
>> >> On Wed, Sep 21, 2016 at 9:29 PM, Erik Schnetter 
>> >> wrote:
>> >> > On Wed, Sep 21, 2016 at 9:22 PM, Chris Rackauckas 
>>
>> >> > wrote:
>> >> >>
>> >> >> I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg
>> and
>> >> >> now
>> >> >> I get results where g and h apply muladd/fma in the native code,
>> but a
>> >> >> new
>> >> >> function k which is `@fastmath` inside of f does not apply
>> muladd/fma.
>> >> >>
>> >> >>
>> >> >> https://gist.github.com/ChrisRackauckas/b239e33b4b52bcc28f39
>> 22c673a25910
>> >> >>
>> >> >> Should I open an issue?
>> >> >
>> >> >
>> >> > In your case, LLVM apparently thinks that `x + x + 3` is faster to
>> >> > calculate
>> >> > than `2x+3`. If you use a less round number than `2` multiplying
>> `x`,
>> >> > you
>> >> > might see a different behaviour.
>> >>
>> >> I've personally never seen llvm create fma from mul and add. We might
>> >> not have the llvm passes enabled if LLVM is capable of doing this at
>> >> all.
>> >>
>> >> >
>> >> > -erik
>> >> >
>> >> >
>> >> >> Note that this is on v0.6 Windows. On Linux the sysimg isn't
>> rebuilding
>> >> >> for some reason, so I may need to just build from source.
>> >> >>
>> >> >> On Wednesday, September 21, 2016 at 6:22:06 AM UTC-7, Erik
>> Schnetter
>> >> >> wrote:
>> >> >>>
>> >> >>> On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas <
>> rack...@gmail.com>
>> >> >>> wrote:
>> >> 
>> >>  Hi,
>> >>    First of all, does LLVM essentially fma or muladd expressions
>> like
>> >>  `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one
>> >>  explicitly use
>> >>  `muladd` and `fma` on these types of instructions (is there a
>> macro
>> >>  for
>> >>  making this easier)?
>> >> >>>
>> >> >>>
>> >> >>> Yes, LLVM will use fma machine instructions -- but only if they
>> lead
>> >> >>> to
>> >> >>> the same round-off error as using separate multiply and add
>> >> >>> instructions. If
>> >> >>> you do not care about the details of conforming to the IEEE
>> standard,
>> >> >>> then
>> >> >>> you can use the `@fastmath` macro that enables several
>> optimizations,
>> >> >>> including this one. This is described in the manual
>> >> >>>
>> >> >>> > -tips/#performance-annotations>.
>> >> >>>
>> >> >>>
>> >>    Secondly, I am wondering if my setup is no applying these
>> >>  operations
>> >>  correctly. Here's my test code:
>> >> 
>> >>  f(x) = 2.0x + 3.0
>> >>  g(x) = muladd(x,2.0, 3.0)
>> >>  h(x) = fma(x,2.0, 3.0)
>> >> 
>> >>  @code_llvm f(4.0)
>> >>  @code_llvm g(4.0)
>> >>  @code_llvm h(4.0)
>> >> 
>> >>  @code_native f(4.0)
>> >>  @code_native g(4.0)
>> >>  @code_native h(4.0)
>> >> 
>> >>  Computer 1
>> >> 
>> >>  Julia Version 0.5.0-rc4+0
>> >>  Commit 9c76c3e* (2016-09-09 01:43 UTC)
>> >>  Platform Info:
>> >>    System: Linux (x86_64-redhat-linux)
>> >>    CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>> >>    WORD_SIZE: 64
>> >>    BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
>> >>    LAPACK: libopenblasp.so.0
>> >>    LIBM: libopenlibm
>> >>    LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
>> >> >>>
>> >> >>>
>> >> >>> This looks good, the "broadwell" architecture that LLVM uses
>> should
>> >> >>> imply
>> >> >>> the respective optimizations. Try with `@fastmath`.
>> >> >>>
>> >> >>> -erik
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> 
>> >>  (the COPR nightly on CentOS7) with
>> >> 
>> >>  [crackauc@crackauc2 ~]$ lscpu
>> >>  Architecture:  x86_64
>> >>  CPU op-mode(s):32-bit, 64-bit
>> >>  Byte Order:Little Endian
>> >>  CPU(s):16
>> >>  On-line CPU(s) list:   0-15
>> >>  Thread(s) per core:1
>> >>  Core(s) per socket:8
>> >>  Socket(s): 2
>> >>  NUMA node(s):  2
>> >>  Vendor ID: GenuineIntel
>> >>  CPU family:6
>> >>  Model: 79
>> >>  Model 

Re: [julia-users] Is FMA/Muladd Working Here?

2016-09-22 Thread Chris Rackauckas
So, in the end, is `@fastmath` supposed to be adding FMA? Should I open an 
issue?

On Wednesday, September 21, 2016 at 7:11:14 PM UTC-7, Yichao Yu wrote:
>
> On Wed, Sep 21, 2016 at 9:49 PM, Erik Schnetter  > wrote: 
> > I confirm that I can't get Julia to synthesize a `vfmadd` instruction 
> > either... Sorry for sending you on a wild goose chase. 
>
> -march=haswell does the trick for C (both clang and gcc) 
> the necessary bit for the machine ir optimization (this is not a llvm 
> ir optimization pass) to do this is llc options -mcpu=haswell and 
> function attribute unsafe-fp-math=true. 
>
> > 
> > -erik 
> > 
> > On Wed, Sep 21, 2016 at 9:33 PM, Yichao Yu  > wrote: 
> >> 
> >> On Wed, Sep 21, 2016 at 9:29 PM, Erik Schnetter  > 
> >> wrote: 
> >> > On Wed, Sep 21, 2016 at 9:22 PM, Chris Rackauckas  > 
> >> > wrote: 
> >> >> 
> >> >> I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg 
> and 
> >> >> now 
> >> >> I get results where g and h apply muladd/fma in the native code, but 
> a 
> >> >> new 
> >> >> function k which is `@fastmath` inside of f does not apply 
> muladd/fma. 
> >> >> 
> >> >> 
> >> >> 
> https://gist.github.com/ChrisRackauckas/b239e33b4b52bcc28f3922c673a25910 
> >> >> 
> >> >> Should I open an issue? 
> >> > 
> >> > 
> >> > In your case, LLVM apparently thinks that `x + x + 3` is faster to 
> >> > calculate 
> >> > than `2x+3`. If you use a less round number than `2` multiplying `x`, 
> >> > you 
> >> > might see a different behaviour. 
> >> 
> >> I've personally never seen llvm create fma from mul and add. We might 
> >> not have the llvm passes enabled if LLVM is capable of doing this at 
> >> all. 
> >> 
> >> > 
> >> > -erik 
> >> > 
> >> > 
> >> >> Note that this is on v0.6 Windows. On Linux the sysimg isn't 
> rebuilding 
> >> >> for some reason, so I may need to just build from source. 
> >> >> 
> >> >> On Wednesday, September 21, 2016 at 6:22:06 AM UTC-7, Erik Schnetter 
> >> >> wrote: 
> >> >>> 
> >> >>> On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas <
> rack...@gmail.com> 
> >> >>> wrote: 
> >>  
> >>  Hi, 
> >>    First of all, does LLVM essentially fma or muladd expressions 
> like 
> >>  `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one 
> >>  explicitly use 
> >>  `muladd` and `fma` on these types of instructions (is there a 
> macro 
> >>  for 
> >>  making this easier)? 
> >> >>> 
> >> >>> 
> >> >>> Yes, LLVM will use fma machine instructions -- but only if they 
> lead 
> >> >>> to 
> >> >>> the same round-off error as using separate multiply and add 
> >> >>> instructions. If 
> >> >>> you do not care about the details of conforming to the IEEE 
> standard, 
> >> >>> then 
> >> >>> you can use the `@fastmath` macro that enables several 
> optimizations, 
> >> >>> including this one. This is described in the manual 
> >> >>> 
> >> >>> <
> http://docs.julialang.org/en/release-0.5/manual/performance-tips/#performance-annotations>.
>  
>
> >> >>> 
> >> >>> 
> >>    Secondly, I am wondering if my setup is no applying these 
> >>  operations 
> >>  correctly. Here's my test code: 
> >>  
> >>  f(x) = 2.0x + 3.0 
> >>  g(x) = muladd(x,2.0, 3.0) 
> >>  h(x) = fma(x,2.0, 3.0) 
> >>  
> >>  @code_llvm f(4.0) 
> >>  @code_llvm g(4.0) 
> >>  @code_llvm h(4.0) 
> >>  
> >>  @code_native f(4.0) 
> >>  @code_native g(4.0) 
> >>  @code_native h(4.0) 
> >>  
> >>  Computer 1 
> >>  
> >>  Julia Version 0.5.0-rc4+0 
> >>  Commit 9c76c3e* (2016-09-09 01:43 UTC) 
> >>  Platform Info: 
> >>    System: Linux (x86_64-redhat-linux) 
> >>    CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz 
> >>    WORD_SIZE: 64 
> >>    BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell) 
> >>    LAPACK: libopenblasp.so.0 
> >>    LIBM: libopenlibm 
> >>    LLVM: libLLVM-3.7.1 (ORCJIT, broadwell) 
> >> >>> 
> >> >>> 
> >> >>> This looks good, the "broadwell" architecture that LLVM uses should 
> >> >>> imply 
> >> >>> the respective optimizations. Try with `@fastmath`. 
> >> >>> 
> >> >>> -erik 
> >> >>> 
> >> >>> 
> >> >>> 
> >> >>> 
> >>  
> >>  (the COPR nightly on CentOS7) with 
> >>  
> >>  [crackauc@crackauc2 ~]$ lscpu 
> >>  Architecture:  x86_64 
> >>  CPU op-mode(s):32-bit, 64-bit 
> >>  Byte Order:Little Endian 
> >>  CPU(s):16 
> >>  On-line CPU(s) list:   0-15 
> >>  Thread(s) per core:1 
> >>  Core(s) per socket:8 
> >>  Socket(s): 2 
> >>  NUMA node(s):  2 
> >>  Vendor ID: GenuineIntel 
> >>  CPU family:6 
> >>  Model: 79 
> >>  Model name:Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz 
> >>  Stepping:  1 
> >>  CPU MHz:   

Re: [julia-users] Is FMA/Muladd Working Here?

2016-09-21 Thread Yichao Yu
On Wed, Sep 21, 2016 at 9:49 PM, Erik Schnetter  wrote:
> I confirm that I can't get Julia to synthesize a `vfmadd` instruction
> either... Sorry for sending you on a wild goose chase.

-march=haswell does the trick for C (both clang and gcc)
the necessary bit for the machine ir optimization (this is not a llvm
ir optimization pass) to do this is llc options -mcpu=haswell and
function attribute unsafe-fp-math=true.

>
> -erik
>
> On Wed, Sep 21, 2016 at 9:33 PM, Yichao Yu  wrote:
>>
>> On Wed, Sep 21, 2016 at 9:29 PM, Erik Schnetter 
>> wrote:
>> > On Wed, Sep 21, 2016 at 9:22 PM, Chris Rackauckas 
>> > wrote:
>> >>
>> >> I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg and
>> >> now
>> >> I get results where g and h apply muladd/fma in the native code, but a
>> >> new
>> >> function k which is `@fastmath` inside of f does not apply muladd/fma.
>> >>
>> >>
>> >> https://gist.github.com/ChrisRackauckas/b239e33b4b52bcc28f3922c673a25910
>> >>
>> >> Should I open an issue?
>> >
>> >
>> > In your case, LLVM apparently thinks that `x + x + 3` is faster to
>> > calculate
>> > than `2x+3`. If you use a less round number than `2` multiplying `x`,
>> > you
>> > might see a different behaviour.
>>
>> I've personally never seen llvm create fma from mul and add. We might
>> not have the llvm passes enabled if LLVM is capable of doing this at
>> all.
>>
>> >
>> > -erik
>> >
>> >
>> >> Note that this is on v0.6 Windows. On Linux the sysimg isn't rebuilding
>> >> for some reason, so I may need to just build from source.
>> >>
>> >> On Wednesday, September 21, 2016 at 6:22:06 AM UTC-7, Erik Schnetter
>> >> wrote:
>> >>>
>> >>> On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas 
>> >>> wrote:
>> 
>>  Hi,
>>    First of all, does LLVM essentially fma or muladd expressions like
>>  `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one
>>  explicitly use
>>  `muladd` and `fma` on these types of instructions (is there a macro
>>  for
>>  making this easier)?
>> >>>
>> >>>
>> >>> Yes, LLVM will use fma machine instructions -- but only if they lead
>> >>> to
>> >>> the same round-off error as using separate multiply and add
>> >>> instructions. If
>> >>> you do not care about the details of conforming to the IEEE standard,
>> >>> then
>> >>> you can use the `@fastmath` macro that enables several optimizations,
>> >>> including this one. This is described in the manual
>> >>>
>> >>> .
>> >>>
>> >>>
>>    Secondly, I am wondering if my setup is no applying these
>>  operations
>>  correctly. Here's my test code:
>> 
>>  f(x) = 2.0x + 3.0
>>  g(x) = muladd(x,2.0, 3.0)
>>  h(x) = fma(x,2.0, 3.0)
>> 
>>  @code_llvm f(4.0)
>>  @code_llvm g(4.0)
>>  @code_llvm h(4.0)
>> 
>>  @code_native f(4.0)
>>  @code_native g(4.0)
>>  @code_native h(4.0)
>> 
>>  Computer 1
>> 
>>  Julia Version 0.5.0-rc4+0
>>  Commit 9c76c3e* (2016-09-09 01:43 UTC)
>>  Platform Info:
>>    System: Linux (x86_64-redhat-linux)
>>    CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>>    WORD_SIZE: 64
>>    BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
>>    LAPACK: libopenblasp.so.0
>>    LIBM: libopenlibm
>>    LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
>> >>>
>> >>>
>> >>> This looks good, the "broadwell" architecture that LLVM uses should
>> >>> imply
>> >>> the respective optimizations. Try with `@fastmath`.
>> >>>
>> >>> -erik
>> >>>
>> >>>
>> >>>
>> >>>
>> 
>>  (the COPR nightly on CentOS7) with
>> 
>>  [crackauc@crackauc2 ~]$ lscpu
>>  Architecture:  x86_64
>>  CPU op-mode(s):32-bit, 64-bit
>>  Byte Order:Little Endian
>>  CPU(s):16
>>  On-line CPU(s) list:   0-15
>>  Thread(s) per core:1
>>  Core(s) per socket:8
>>  Socket(s): 2
>>  NUMA node(s):  2
>>  Vendor ID: GenuineIntel
>>  CPU family:6
>>  Model: 79
>>  Model name:Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>>  Stepping:  1
>>  CPU MHz:   1200.000
>>  BogoMIPS:  6392.58
>>  Virtualization:VT-x
>>  L1d cache: 32K
>>  L1i cache: 32K
>>  L2 cache:  256K
>>  L3 cache:  25600K
>>  NUMA node0 CPU(s): 0-7
>>  NUMA node1 CPU(s): 8-15
>> 
>> 
>> 
>>  I get the output
>> 
>>  define double @julia_f_72025(double) #0 {
>>  top:
>>    %1 = fmul double %0, 2.00e+00
>>    %2 = fadd double %1, 3.00e+00
>>    ret double %2
>>  }
>> 
>>  define double @julia_g_72027(double) #0 {
>> 

Re: [julia-users] Is FMA/Muladd Working Here?

2016-09-21 Thread Erik Schnetter
I confirm that I can't get Julia to synthesize a `vfmadd` instruction
either... Sorry for sending you on a wild goose chase.

-erik

On Wed, Sep 21, 2016 at 9:33 PM, Yichao Yu  wrote:

> On Wed, Sep 21, 2016 at 9:29 PM, Erik Schnetter 
> wrote:
> > On Wed, Sep 21, 2016 at 9:22 PM, Chris Rackauckas 
> > wrote:
> >>
> >> I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg and
> now
> >> I get results where g and h apply muladd/fma in the native code, but a
> new
> >> function k which is `@fastmath` inside of f does not apply muladd/fma.
> >>
> >> https://gist.github.com/ChrisRackauckas/b239e33b4b52bcc28f3922c673a259
> 10
> >>
> >> Should I open an issue?
> >
> >
> > In your case, LLVM apparently thinks that `x + x + 3` is faster to
> calculate
> > than `2x+3`. If you use a less round number than `2` multiplying `x`, you
> > might see a different behaviour.
>
> I've personally never seen llvm create fma from mul and add. We might
> not have the llvm passes enabled if LLVM is capable of doing this at
> all.
>
> >
> > -erik
> >
> >
> >> Note that this is on v0.6 Windows. On Linux the sysimg isn't rebuilding
> >> for some reason, so I may need to just build from source.
> >>
> >> On Wednesday, September 21, 2016 at 6:22:06 AM UTC-7, Erik Schnetter
> >> wrote:
> >>>
> >>> On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas 
> >>> wrote:
> 
>  Hi,
>    First of all, does LLVM essentially fma or muladd expressions like
>  `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one
> explicitly use
>  `muladd` and `fma` on these types of instructions (is there a macro
> for
>  making this easier)?
> >>>
> >>>
> >>> Yes, LLVM will use fma machine instructions -- but only if they lead to
> >>> the same round-off error as using separate multiply and add
> instructions. If
> >>> you do not care about the details of conforming to the IEEE standard,
> then
> >>> you can use the `@fastmath` macro that enables several optimizations,
> >>> including this one. This is described in the manual
> >>>  performance-tips/#performance-annotations>.
> >>>
> >>>
>    Secondly, I am wondering if my setup is no applying these operations
>  correctly. Here's my test code:
> 
>  f(x) = 2.0x + 3.0
>  g(x) = muladd(x,2.0, 3.0)
>  h(x) = fma(x,2.0, 3.0)
> 
>  @code_llvm f(4.0)
>  @code_llvm g(4.0)
>  @code_llvm h(4.0)
> 
>  @code_native f(4.0)
>  @code_native g(4.0)
>  @code_native h(4.0)
> 
>  Computer 1
> 
>  Julia Version 0.5.0-rc4+0
>  Commit 9c76c3e* (2016-09-09 01:43 UTC)
>  Platform Info:
>    System: Linux (x86_64-redhat-linux)
>    CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>    WORD_SIZE: 64
>    BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
>    LAPACK: libopenblasp.so.0
>    LIBM: libopenlibm
>    LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
> >>>
> >>>
> >>> This looks good, the "broadwell" architecture that LLVM uses should
> imply
> >>> the respective optimizations. Try with `@fastmath`.
> >>>
> >>> -erik
> >>>
> >>>
> >>>
> >>>
> 
>  (the COPR nightly on CentOS7) with
> 
>  [crackauc@crackauc2 ~]$ lscpu
>  Architecture:  x86_64
>  CPU op-mode(s):32-bit, 64-bit
>  Byte Order:Little Endian
>  CPU(s):16
>  On-line CPU(s) list:   0-15
>  Thread(s) per core:1
>  Core(s) per socket:8
>  Socket(s): 2
>  NUMA node(s):  2
>  Vendor ID: GenuineIntel
>  CPU family:6
>  Model: 79
>  Model name:Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>  Stepping:  1
>  CPU MHz:   1200.000
>  BogoMIPS:  6392.58
>  Virtualization:VT-x
>  L1d cache: 32K
>  L1i cache: 32K
>  L2 cache:  256K
>  L3 cache:  25600K
>  NUMA node0 CPU(s): 0-7
>  NUMA node1 CPU(s): 8-15
> 
> 
> 
>  I get the output
> 
>  define double @julia_f_72025(double) #0 {
>  top:
>    %1 = fmul double %0, 2.00e+00
>    %2 = fadd double %1, 3.00e+00
>    ret double %2
>  }
> 
>  define double @julia_g_72027(double) #0 {
>  top:
>    %1 = call double @llvm.fmuladd.f64(double %0, double 2.00e+00,
>  double 3.00e+00)
>    ret double %1
>  }
> 
>  define double @julia_h_72029(double) #0 {
>  top:
>    %1 = call double @llvm.fma.f64(double %0, double 2.00e+00,
> double
>  3.00e+00)
>    ret double %1
>  }
>  .text
>  Filename: fmatest.jl
>  pushq %rbp
>  movq %rsp, %rbp
>  Source line: 1
>  addsd %xmm0, %xmm0
>  movabsq $139916162906520, %rax  # 

Re: [julia-users] Is FMA/Muladd Working Here?

2016-09-21 Thread Yichao Yu
On Wed, Sep 21, 2016 at 9:33 PM, Yichao Yu  wrote:
> On Wed, Sep 21, 2016 at 9:29 PM, Erik Schnetter  wrote:
>> On Wed, Sep 21, 2016 at 9:22 PM, Chris Rackauckas 
>> wrote:
>>>
>>> I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg and now
>>> I get results where g and h apply muladd/fma in the native code, but a new
>>> function k which is `@fastmath` inside of f does not apply muladd/fma.
>>>
>>> https://gist.github.com/ChrisRackauckas/b239e33b4b52bcc28f3922c673a25910
>>>
>>> Should I open an issue?
>>
>>
>> In your case, LLVM apparently thinks that `x + x + 3` is faster to calculate
>> than `2x+3`. If you use a less round number than `2` multiplying `x`, you
>> might see a different behaviour.
>
> I've personally never seen llvm create fma from mul and add. We might
> not have the llvm passes enabled if LLVM is capable of doing this at
> all.

Interestingly both clang and gcc keeps the mul and add with `-Ofast
-ffast-math -mavx2` and makes it a fma with `-mavx512f`. This is true
even when the call is in a loop (since switching between sse and avx
is costly) so I'd say either the compiler is right that the fma
instruction gives no speed advantage in this case or it's a llvm/gcc
missing optimization instead of a julia one.

>
>>
>> -erik
>>
>>
>>> Note that this is on v0.6 Windows. On Linux the sysimg isn't rebuilding
>>> for some reason, so I may need to just build from source.
>>>
>>> On Wednesday, September 21, 2016 at 6:22:06 AM UTC-7, Erik Schnetter
>>> wrote:

 On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas 
 wrote:
>
> Hi,
>   First of all, does LLVM essentially fma or muladd expressions like
> `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one explicitly use
> `muladd` and `fma` on these types of instructions (is there a macro for
> making this easier)?


 Yes, LLVM will use fma machine instructions -- but only if they lead to
 the same round-off error as using separate multiply and add instructions. 
 If
 you do not care about the details of conforming to the IEEE standard, then
 you can use the `@fastmath` macro that enables several optimizations,
 including this one. This is described in the manual
 .


>   Secondly, I am wondering if my setup is no applying these operations
> correctly. Here's my test code:
>
> f(x) = 2.0x + 3.0
> g(x) = muladd(x,2.0, 3.0)
> h(x) = fma(x,2.0, 3.0)
>
> @code_llvm f(4.0)
> @code_llvm g(4.0)
> @code_llvm h(4.0)
>
> @code_native f(4.0)
> @code_native g(4.0)
> @code_native h(4.0)
>
> Computer 1
>
> Julia Version 0.5.0-rc4+0
> Commit 9c76c3e* (2016-09-09 01:43 UTC)
> Platform Info:
>   System: Linux (x86_64-redhat-linux)
>   CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>   WORD_SIZE: 64
>   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
>   LAPACK: libopenblasp.so.0
>   LIBM: libopenlibm
>   LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)


 This looks good, the "broadwell" architecture that LLVM uses should imply
 the respective optimizations. Try with `@fastmath`.

 -erik




>
> (the COPR nightly on CentOS7) with
>
> [crackauc@crackauc2 ~]$ lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):16
> On-line CPU(s) list:   0-15
> Thread(s) per core:1
> Core(s) per socket:8
> Socket(s): 2
> NUMA node(s):  2
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 79
> Model name:Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
> Stepping:  1
> CPU MHz:   1200.000
> BogoMIPS:  6392.58
> Virtualization:VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache:  256K
> L3 cache:  25600K
> NUMA node0 CPU(s): 0-7
> NUMA node1 CPU(s): 8-15
>
>
>
> I get the output
>
> define double @julia_f_72025(double) #0 {
> top:
>   %1 = fmul double %0, 2.00e+00
>   %2 = fadd double %1, 3.00e+00
>   ret double %2
> }
>
> define double @julia_g_72027(double) #0 {
> top:
>   %1 = call double @llvm.fmuladd.f64(double %0, double 2.00e+00,
> double 3.00e+00)
>   ret double %1
> }
>
> define double @julia_h_72029(double) #0 {
> top:
>   %1 = call double @llvm.fma.f64(double %0, double 2.00e+00, double
> 3.00e+00)
>   ret double %1
> }
> .text
> Filename: fmatest.jl
> pushq %rbp
> movq 

Re: [julia-users] Is FMA/Muladd Working Here?

2016-09-21 Thread Yichao Yu
On Wed, Sep 21, 2016 at 9:29 PM, Erik Schnetter  wrote:
> On Wed, Sep 21, 2016 at 9:22 PM, Chris Rackauckas 
> wrote:
>>
>> I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg and now
>> I get results where g and h apply muladd/fma in the native code, but a new
>> function k which is `@fastmath` inside of f does not apply muladd/fma.
>>
>> https://gist.github.com/ChrisRackauckas/b239e33b4b52bcc28f3922c673a25910
>>
>> Should I open an issue?
>
>
> In your case, LLVM apparently thinks that `x + x + 3` is faster to calculate
> than `2x+3`. If you use a less round number than `2` multiplying `x`, you
> might see a different behaviour.

I've personally never seen llvm create fma from mul and add. We might
not have the llvm passes enabled if LLVM is capable of doing this at
all.

>
> -erik
>
>
>> Note that this is on v0.6 Windows. On Linux the sysimg isn't rebuilding
>> for some reason, so I may need to just build from source.
>>
>> On Wednesday, September 21, 2016 at 6:22:06 AM UTC-7, Erik Schnetter
>> wrote:
>>>
>>> On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas 
>>> wrote:

 Hi,
   First of all, does LLVM essentially fma or muladd expressions like
 `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one explicitly use
 `muladd` and `fma` on these types of instructions (is there a macro for
 making this easier)?
>>>
>>>
>>> Yes, LLVM will use fma machine instructions -- but only if they lead to
>>> the same round-off error as using separate multiply and add instructions. If
>>> you do not care about the details of conforming to the IEEE standard, then
>>> you can use the `@fastmath` macro that enables several optimizations,
>>> including this one. This is described in the manual
>>> .
>>>
>>>
   Secondly, I am wondering if my setup is no applying these operations
 correctly. Here's my test code:

 f(x) = 2.0x + 3.0
 g(x) = muladd(x,2.0, 3.0)
 h(x) = fma(x,2.0, 3.0)

 @code_llvm f(4.0)
 @code_llvm g(4.0)
 @code_llvm h(4.0)

 @code_native f(4.0)
 @code_native g(4.0)
 @code_native h(4.0)

 Computer 1

 Julia Version 0.5.0-rc4+0
 Commit 9c76c3e* (2016-09-09 01:43 UTC)
 Platform Info:
   System: Linux (x86_64-redhat-linux)
   CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
   WORD_SIZE: 64
   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
   LAPACK: libopenblasp.so.0
   LIBM: libopenlibm
   LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
>>>
>>>
>>> This looks good, the "broadwell" architecture that LLVM uses should imply
>>> the respective optimizations. Try with `@fastmath`.
>>>
>>> -erik
>>>
>>>
>>>
>>>

 (the COPR nightly on CentOS7) with

 [crackauc@crackauc2 ~]$ lscpu
 Architecture:  x86_64
 CPU op-mode(s):32-bit, 64-bit
 Byte Order:Little Endian
 CPU(s):16
 On-line CPU(s) list:   0-15
 Thread(s) per core:1
 Core(s) per socket:8
 Socket(s): 2
 NUMA node(s):  2
 Vendor ID: GenuineIntel
 CPU family:6
 Model: 79
 Model name:Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
 Stepping:  1
 CPU MHz:   1200.000
 BogoMIPS:  6392.58
 Virtualization:VT-x
 L1d cache: 32K
 L1i cache: 32K
 L2 cache:  256K
 L3 cache:  25600K
 NUMA node0 CPU(s): 0-7
 NUMA node1 CPU(s): 8-15



 I get the output

 define double @julia_f_72025(double) #0 {
 top:
   %1 = fmul double %0, 2.00e+00
   %2 = fadd double %1, 3.00e+00
   ret double %2
 }

 define double @julia_g_72027(double) #0 {
 top:
   %1 = call double @llvm.fmuladd.f64(double %0, double 2.00e+00,
 double 3.00e+00)
   ret double %1
 }

 define double @julia_h_72029(double) #0 {
 top:
   %1 = call double @llvm.fma.f64(double %0, double 2.00e+00, double
 3.00e+00)
   ret double %1
 }
 .text
 Filename: fmatest.jl
 pushq %rbp
 movq %rsp, %rbp
 Source line: 1
 addsd %xmm0, %xmm0
 movabsq $139916162906520, %rax  # imm = 0x7F40C5303998
 addsd (%rax), %xmm0
 popq %rbp
 retq
 nopl (%rax,%rax)
 .text
 Filename: fmatest.jl
 pushq %rbp
 movq %rsp, %rbp
 Source line: 2
 addsd %xmm0, %xmm0
 movabsq $139916162906648, %rax  # imm = 0x7F40C5303A18
 addsd (%rax), %xmm0
 popq %rbp
 retq
 nopl (%rax,%rax)
 .text
 Filename: fmatest.jl
 pushq %rbp
 movq %rsp, %rbp
 movabsq $139916162906776, %rax  # imm = 0x7F40C5303A98
 Source line: 3
 movsd (%rax), 

Re: [julia-users] Is FMA/Muladd Working Here?

2016-09-21 Thread Chris Rackauckas
Still no FMA?

julia> k(x) = @fastmath 2.4x + 3.0
WARNING: Method definition k(Any) in module Main at REPL[14]:1 overwritten 
at REPL[23]:1.
k (generic function with 1 method)

julia> @code_llvm k(4.0)

; Function Attrs: uwtable
define double @julia_k_66737(double) #0 {
top:
  %1 = fmul fast double %0, 2.40e+00
  %2 = fadd fast double %1, 3.00e+00
  ret double %2
}

julia> @code_native k(4.0)
.text
Filename: REPL[23]
pushq   %rbp
movq%rsp, %rbp
movabsq $568231032, %rax# imm = 0x21DE8478
Source line: 1
vmulsd  (%rax), %xmm0, %xmm0
movabsq $568231040, %rax# imm = 0x21DE8480
vaddsd  (%rax), %xmm0, %xmm0
popq%rbp
retq
nopw%cs:(%rax,%rax)



On Wednesday, September 21, 2016 at 6:29:44 PM UTC-7, Erik Schnetter wrote:
>
> On Wed, Sep 21, 2016 at 9:22 PM, Chris Rackauckas  > wrote:
>
>> I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg and now 
>> I get results where g and h apply muladd/fma in the native code, but a new 
>> function k which is `@fastmath` inside of f does not apply muladd/fma.
>>
>> https://gist.github.com/ChrisRackauckas/b239e33b4b52bcc28f3922c673a25910
>>
>> Should I open an issue?
>>
>
> In your case, LLVM apparently thinks that `x + x + 3` is faster to 
> calculate than `2x+3`. If you use a less round number than `2` multiplying 
> `x`, you might see a different behaviour.
>
> -erik
>
>
> Note that this is on v0.6 Windows. On Linux the sysimg isn't rebuilding 
>> for some reason, so I may need to just build from source.
>>
>> On Wednesday, September 21, 2016 at 6:22:06 AM UTC-7, Erik Schnetter 
>> wrote:
>>>
>>> On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas  
>>> wrote:
>>>
 Hi,
   First of all, does LLVM essentially fma or muladd expressions like 
 `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one explicitly use 
 `muladd` and `fma` on these types of instructions (is there a macro for 
 making this easier)?

>>>
>>> Yes, LLVM will use fma machine instructions -- but only if they lead to 
>>> the same round-off error as using separate multiply and add instructions. 
>>> If you do not care about the details of conforming to the IEEE standard, 
>>> then you can use the `@fastmath` macro that enables several optimizations, 
>>> including this one. This is described in the manual <
>>> http://docs.julialang.org/en/release-0.5/manual/performance-tips/#performance-annotations
>>> >.
>>>
>>>
>>>   Secondly, I am wondering if my setup is no applying these operations 
 correctly. Here's my test code:

 f(x) = 2.0x + 3.0
 g(x) = muladd(x,2.0, 3.0)
 h(x) = fma(x,2.0, 3.0)

 @code_llvm f(4.0)
 @code_llvm g(4.0)
 @code_llvm h(4.0)

 @code_native f(4.0)
 @code_native g(4.0)
 @code_native h(4.0)

 *Computer 1*

 Julia Version 0.5.0-rc4+0
 Commit 9c76c3e* (2016-09-09 01:43 UTC)
 Platform Info:
   System: Linux (x86_64-redhat-linux)
   CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
   WORD_SIZE: 64
   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
   LAPACK: libopenblasp.so.0
   LIBM: libopenlibm
   LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)

>>>
>>> This looks good, the "broadwell" architecture that LLVM uses should 
>>> imply the respective optimizations. Try with `@fastmath`.
>>>
>>> -erik
>>>
>>>
>>>
>>>  
>>>
 (the COPR nightly on CentOS7) with 

 [crackauc@crackauc2 ~]$ lscpu
 Architecture:  x86_64
 CPU op-mode(s):32-bit, 64-bit
 Byte Order:Little Endian
 CPU(s):16
 On-line CPU(s) list:   0-15
 Thread(s) per core:1
 Core(s) per socket:8
 Socket(s): 2
 NUMA node(s):  2
 Vendor ID: GenuineIntel
 CPU family:6
 Model: 79
 Model name:Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
 Stepping:  1
 CPU MHz:   1200.000
 BogoMIPS:  6392.58
 Virtualization:VT-x
 L1d cache: 32K
 L1i cache: 32K
 L2 cache:  256K
 L3 cache:  25600K
 NUMA node0 CPU(s): 0-7
 NUMA node1 CPU(s): 8-15



 I get the output

 define double @julia_f_72025(double) #0 {
 top:
   %1 = fmul double %0, 2.00e+00
   %2 = fadd double %1, 3.00e+00
   ret double %2
 }

 define double @julia_g_72027(double) #0 {
 top:
   %1 = call double @llvm.fmuladd.f64(double %0, double 2.00e+00, 
 double 3.00e+00)
   ret double %1
 }

 define double @julia_h_72029(double) #0 {
 top:
   %1 = call double @llvm.fma.f64(double %0, double 2.00e+00, double 
 3.00e+00)
   ret double %1
 }
 .text
 Filename: 

Re: [julia-users] Is FMA/Muladd Working Here?

2016-09-21 Thread Erik Schnetter
On Wed, Sep 21, 2016 at 9:22 PM, Chris Rackauckas 
wrote:

> I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg and now
> I get results where g and h apply muladd/fma in the native code, but a new
> function k which is `@fastmath` inside of f does not apply muladd/fma.
>
> https://gist.github.com/ChrisRackauckas/b239e33b4b52bcc28f3922c673a25910
>
> Should I open an issue?
>

In your case, LLVM apparently thinks that `x + x + 3` is faster to
calculate than `2x+3`. If you use a less round number than `2` multiplying
`x`, you might see a different behaviour.

-erik


Note that this is on v0.6 Windows. On Linux the sysimg isn't rebuilding for
> some reason, so I may need to just build from source.
>
> On Wednesday, September 21, 2016 at 6:22:06 AM UTC-7, Erik Schnetter wrote:
>>
>> On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas 
>> wrote:
>>
>>> Hi,
>>>   First of all, does LLVM essentially fma or muladd expressions like
>>> `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one explicitly use
>>> `muladd` and `fma` on these types of instructions (is there a macro for
>>> making this easier)?
>>>
>>
>> Yes, LLVM will use fma machine instructions -- but only if they lead to
>> the same round-off error as using separate multiply and add instructions.
>> If you do not care about the details of conforming to the IEEE standard,
>> then you can use the `@fastmath` macro that enables several optimizations,
>> including this one. This is described in the manual <
>> http://docs.julialang.org/en/release-0.5/manual/performance
>> -tips/#performance-annotations>.
>>
>>
>>   Secondly, I am wondering if my setup is no applying these operations
>>> correctly. Here's my test code:
>>>
>>> f(x) = 2.0x + 3.0
>>> g(x) = muladd(x,2.0, 3.0)
>>> h(x) = fma(x,2.0, 3.0)
>>>
>>> @code_llvm f(4.0)
>>> @code_llvm g(4.0)
>>> @code_llvm h(4.0)
>>>
>>> @code_native f(4.0)
>>> @code_native g(4.0)
>>> @code_native h(4.0)
>>>
>>> *Computer 1*
>>>
>>> Julia Version 0.5.0-rc4+0
>>> Commit 9c76c3e* (2016-09-09 01:43 UTC)
>>> Platform Info:
>>>   System: Linux (x86_64-redhat-linux)
>>>   CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>>>   WORD_SIZE: 64
>>>   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
>>>   LAPACK: libopenblasp.so.0
>>>   LIBM: libopenlibm
>>>   LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
>>>
>>
>> This looks good, the "broadwell" architecture that LLVM uses should imply
>> the respective optimizations. Try with `@fastmath`.
>>
>> -erik
>>
>>
>>
>>
>>
>>> (the COPR nightly on CentOS7) with
>>>
>>> [crackauc@crackauc2 ~]$ lscpu
>>> Architecture:  x86_64
>>> CPU op-mode(s):32-bit, 64-bit
>>> Byte Order:Little Endian
>>> CPU(s):16
>>> On-line CPU(s) list:   0-15
>>> Thread(s) per core:1
>>> Core(s) per socket:8
>>> Socket(s): 2
>>> NUMA node(s):  2
>>> Vendor ID: GenuineIntel
>>> CPU family:6
>>> Model: 79
>>> Model name:Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>>> Stepping:  1
>>> CPU MHz:   1200.000
>>> BogoMIPS:  6392.58
>>> Virtualization:VT-x
>>> L1d cache: 32K
>>> L1i cache: 32K
>>> L2 cache:  256K
>>> L3 cache:  25600K
>>> NUMA node0 CPU(s): 0-7
>>> NUMA node1 CPU(s): 8-15
>>>
>>>
>>>
>>> I get the output
>>>
>>> define double @julia_f_72025(double) #0 {
>>> top:
>>>   %1 = fmul double %0, 2.00e+00
>>>   %2 = fadd double %1, 3.00e+00
>>>   ret double %2
>>> }
>>>
>>> define double @julia_g_72027(double) #0 {
>>> top:
>>>   %1 = call double @llvm.fmuladd.f64(double %0, double 2.00e+00,
>>> double 3.00e+00)
>>>   ret double %1
>>> }
>>>
>>> define double @julia_h_72029(double) #0 {
>>> top:
>>>   %1 = call double @llvm.fma.f64(double %0, double 2.00e+00, double
>>> 3.00e+00)
>>>   ret double %1
>>> }
>>> .text
>>> Filename: fmatest.jl
>>> pushq %rbp
>>> movq %rsp, %rbp
>>> Source line: 1
>>> addsd %xmm0, %xmm0
>>> movabsq $139916162906520, %rax  # imm = 0x7F40C5303998
>>> addsd (%rax), %xmm0
>>> popq %rbp
>>> retq
>>> nopl (%rax,%rax)
>>> .text
>>> Filename: fmatest.jl
>>> pushq %rbp
>>> movq %rsp, %rbp
>>> Source line: 2
>>> addsd %xmm0, %xmm0
>>> movabsq $139916162906648, %rax  # imm = 0x7F40C5303A18
>>> addsd (%rax), %xmm0
>>> popq %rbp
>>> retq
>>> nopl (%rax,%rax)
>>> .text
>>> Filename: fmatest.jl
>>> pushq %rbp
>>> movq %rsp, %rbp
>>> movabsq $139916162906776, %rax  # imm = 0x7F40C5303A98
>>> Source line: 3
>>> movsd (%rax), %xmm1   # xmm1 = mem[0],zero
>>> movabsq $139916162906784, %rax  # imm = 0x7F40C5303AA0
>>> movsd (%rax), %xmm2   # xmm2 = mem[0],zero
>>> movabsq $139925776008800, %rax  # imm = 0x7F43022C8660
>>> popq %rbp
>>> jmpq *%rax
>>> nopl (%rax)
>>>
>>> It looks like explicit muladd or not ends up at the same native code,
>>> but is that native code actually doing an fma? The 

Re: [julia-users] Is FMA/Muladd Working Here?

2016-09-21 Thread Erik Schnetter
On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas 
wrote:

> Hi,
>   First of all, does LLVM essentially fma or muladd expressions like
> `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one explicitly use
> `muladd` and `fma` on these types of instructions (is there a macro for
> making this easier)?
>

Yes, LLVM will use fma machine instructions -- but only if they lead to the
same round-off error as using separate multiply and add instructions. If
you do not care about the details of conforming to the IEEE standard, then
you can use the `@fastmath` macro that enables several optimizations,
including this one. This is described in the manual <
http://docs.julialang.org/en/release-0.5/manual/performance-tips/#performance-annotations
>.


  Secondly, I am wondering if my setup is no applying these operations
> correctly. Here's my test code:
>
> f(x) = 2.0x + 3.0
> g(x) = muladd(x,2.0, 3.0)
> h(x) = fma(x,2.0, 3.0)
>
> @code_llvm f(4.0)
> @code_llvm g(4.0)
> @code_llvm h(4.0)
>
> @code_native f(4.0)
> @code_native g(4.0)
> @code_native h(4.0)
>
> *Computer 1*
>
> Julia Version 0.5.0-rc4+0
> Commit 9c76c3e* (2016-09-09 01:43 UTC)
> Platform Info:
>   System: Linux (x86_64-redhat-linux)
>   CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>   WORD_SIZE: 64
>   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
>   LAPACK: libopenblasp.so.0
>   LIBM: libopenlibm
>   LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
>

This looks good, the "broadwell" architecture that LLVM uses should imply
the respective optimizations. Try with `@fastmath`.

-erik





> (the COPR nightly on CentOS7) with
>
> [crackauc@crackauc2 ~]$ lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):16
> On-line CPU(s) list:   0-15
> Thread(s) per core:1
> Core(s) per socket:8
> Socket(s): 2
> NUMA node(s):  2
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 79
> Model name:Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
> Stepping:  1
> CPU MHz:   1200.000
> BogoMIPS:  6392.58
> Virtualization:VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache:  256K
> L3 cache:  25600K
> NUMA node0 CPU(s): 0-7
> NUMA node1 CPU(s): 8-15
>
>
>
> I get the output
>
> define double @julia_f_72025(double) #0 {
> top:
>   %1 = fmul double %0, 2.00e+00
>   %2 = fadd double %1, 3.00e+00
>   ret double %2
> }
>
> define double @julia_g_72027(double) #0 {
> top:
>   %1 = call double @llvm.fmuladd.f64(double %0, double 2.00e+00,
> double 3.00e+00)
>   ret double %1
> }
>
> define double @julia_h_72029(double) #0 {
> top:
>   %1 = call double @llvm.fma.f64(double %0, double 2.00e+00, double
> 3.00e+00)
>   ret double %1
> }
> .text
> Filename: fmatest.jl
> pushq %rbp
> movq %rsp, %rbp
> Source line: 1
> addsd %xmm0, %xmm0
> movabsq $139916162906520, %rax  # imm = 0x7F40C5303998
> addsd (%rax), %xmm0
> popq %rbp
> retq
> nopl (%rax,%rax)
> .text
> Filename: fmatest.jl
> pushq %rbp
> movq %rsp, %rbp
> Source line: 2
> addsd %xmm0, %xmm0
> movabsq $139916162906648, %rax  # imm = 0x7F40C5303A18
> addsd (%rax), %xmm0
> popq %rbp
> retq
> nopl (%rax,%rax)
> .text
> Filename: fmatest.jl
> pushq %rbp
> movq %rsp, %rbp
> movabsq $139916162906776, %rax  # imm = 0x7F40C5303A98
> Source line: 3
> movsd (%rax), %xmm1   # xmm1 = mem[0],zero
> movabsq $139916162906784, %rax  # imm = 0x7F40C5303AA0
> movsd (%rax), %xmm2   # xmm2 = mem[0],zero
> movabsq $139925776008800, %rax  # imm = 0x7F43022C8660
> popq %rbp
> jmpq *%rax
> nopl (%rax)
>
> It looks like explicit muladd or not ends up at the same native code, but
> is that native code actually doing an fma? The fma native is different, but
> from a discussion on the Gitter it seems that might be a software FMA? This
> computer is setup with the BIOS setting as LAPACK optimized or something
> like that, so is that messing with something?
>
> *Computer 2*
>
> Julia Version 0.6.0-dev.557
> Commit c7a4897 (2016-09-08 17:50 UTC)
> Platform Info:
>   System: NT (x86_64-w64-mingw32)
>   CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
>   WORD_SIZE: 64
>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
>   LAPACK: libopenblas64_
>   LIBM: libopenlibm
>   LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
>
>
> on a 4770k i7, Windows 10, I get the output
>
> ; Function Attrs: uwtable
> define double @julia_f_66153(double) #0 {
> top:
>   %1 = fmul double %0, 2.00e+00
>   %2 = fadd double %1, 3.00e+00
>   ret double %2
> }
>
> ; Function Attrs: uwtable
> define double @julia_g_66157(double) #0 {
> top:
>   %1 = call double @llvm.fmuladd.f64(double %0, double 2.00e+00,
> double 3.00e+00)
>   ret double %1
> }
>
> ; Function Attrs: uwtable
> define double @julia_h_66158(double) #0 {
> top:
>   %1 = call double 

[julia-users] Is FMA/Muladd Working Here?

2016-09-20 Thread Chris Rackauckas
Hi,
  First of all, does LLVM essentially fma or muladd expressions like `a1*x1 
+ a2*x2 + a3*x3 + a4*x4`? Or is it required that one explicitly use 
`muladd` and `fma` on these types of instructions (is there a macro for 
making this easier)?

  Secondly, I am wondering if my setup is no applying these operations 
correctly. Here's my test code:

f(x) = 2.0x + 3.0
g(x) = muladd(x,2.0, 3.0)
h(x) = fma(x,2.0, 3.0)

@code_llvm f(4.0)
@code_llvm g(4.0)
@code_llvm h(4.0)

@code_native f(4.0)
@code_native g(4.0)
@code_native h(4.0)

*Computer 1*

Julia Version 0.5.0-rc4+0
Commit 9c76c3e* (2016-09-09 01:43 UTC)
Platform Info:
  System: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblasp.so.0
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)

(the COPR nightly on CentOS7) with 

[crackauc@crackauc2 ~]$ lscpu
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):16
On-line CPU(s) list:   0-15
Thread(s) per core:1
Core(s) per socket:8
Socket(s): 2
NUMA node(s):  2
Vendor ID: GenuineIntel
CPU family:6
Model: 79
Model name:Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
Stepping:  1
CPU MHz:   1200.000
BogoMIPS:  6392.58
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  25600K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15



I get the output

define double @julia_f_72025(double) #0 {
top:
  %1 = fmul double %0, 2.00e+00
  %2 = fadd double %1, 3.00e+00
  ret double %2
}

define double @julia_g_72027(double) #0 {
top:
  %1 = call double @llvm.fmuladd.f64(double %0, double 2.00e+00, double 
3.00e+00)
  ret double %1
}

define double @julia_h_72029(double) #0 {
top:
  %1 = call double @llvm.fma.f64(double %0, double 2.00e+00, double 
3.00e+00)
  ret double %1
}
.text
Filename: fmatest.jl
pushq %rbp
movq %rsp, %rbp
Source line: 1
addsd %xmm0, %xmm0
movabsq $139916162906520, %rax  # imm = 0x7F40C5303998
addsd (%rax), %xmm0
popq %rbp
retq
nopl (%rax,%rax)
.text
Filename: fmatest.jl
pushq %rbp
movq %rsp, %rbp
Source line: 2
addsd %xmm0, %xmm0
movabsq $139916162906648, %rax  # imm = 0x7F40C5303A18
addsd (%rax), %xmm0
popq %rbp
retq
nopl (%rax,%rax)
.text
Filename: fmatest.jl
pushq %rbp
movq %rsp, %rbp
movabsq $139916162906776, %rax  # imm = 0x7F40C5303A98
Source line: 3
movsd (%rax), %xmm1   # xmm1 = mem[0],zero
movabsq $139916162906784, %rax  # imm = 0x7F40C5303AA0
movsd (%rax), %xmm2   # xmm2 = mem[0],zero
movabsq $139925776008800, %rax  # imm = 0x7F43022C8660
popq %rbp
jmpq *%rax
nopl (%rax)

It looks like explicit muladd or not ends up at the same native code, but 
is that native code actually doing an fma? The fma native is different, but 
from a discussion on the Gitter it seems that might be a software FMA? This 
computer is setup with the BIOS setting as LAPACK optimized or something 
like that, so is that messing with something?

*Computer 2*

Julia Version 0.6.0-dev.557
Commit c7a4897 (2016-09-08 17:50 UTC)
Platform Info:
  System: NT (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)


on a 4770k i7, Windows 10, I get the output

; Function Attrs: uwtable
define double @julia_f_66153(double) #0 {
top:
  %1 = fmul double %0, 2.00e+00
  %2 = fadd double %1, 3.00e+00
  ret double %2
}

; Function Attrs: uwtable
define double @julia_g_66157(double) #0 {
top:
  %1 = call double @llvm.fmuladd.f64(double %0, double 2.00e+00, double 
3.00e+00)
  ret double %1
}

; Function Attrs: uwtable
define double @julia_h_66158(double) #0 {
top:
  %1 = call double @llvm.fma.f64(double %0, double 2.00e+00, double 
3.00e+00)
  ret double %1
}
.text
Filename: console
pushq %rbp
movq %rsp, %rbp
Source line: 1
addsd %xmm0, %xmm0
movabsq $534749456, %rax# imm = 0x1FDFA110
addsd (%rax), %xmm0
popq %rbp
retq
nopl (%rax,%rax)
.text
Filename: console
pushq %rbp
movq %rsp, %rbp
Source line: 2
addsd %xmm0, %xmm0
movabsq $534749584, %rax# imm = 0x1FDFA190
addsd (%rax), %xmm0
popq %rbp
retq
nopl (%rax,%rax)
.text
Filename: console
pushq %rbp
movq %rsp, %rbp
movabsq $534749712, %rax# imm = 0x1FDFA210
Source line: 3
movsd dcabs164_(%rax), %xmm1  # xmm1 = mem[0],zero
movabsq $534749720, %rax# imm = 0x1FDFA218
movsd (%rax), %xmm2   # xmm2 = mem[0],zero
movabsq $fma, %rax
popq %rbp
jmpq *%rax
nop

This seems to be similar to the first result.