[julia-users] Re: Same native code, different performance

2015-09-25 Thread TY
Out of curiosity, I tried the following one (on julia 0.4-rc1)

  f(x) = ( c = cos(x); c^3 )
  f_float(x) = ( c = cos(x); c^3.0 )

then I get
  0.006489 seconds
  0.013220 seconds

but with the original code, I get
  0.076714 seconds
  0.013280 seconds

(both without @fastmath)


Re: [julia-users] Re: Same native code, different performance

2015-09-25 Thread Kristoffer Carlsson
If you want to reproduce this you can use JuliaBox ( I changed cos(x) to x 
because it doesnt change anything)




On Thursday, September 24, 2015 at 11:03:21 PM UTC+2, Erik Schnetter wrote:
>
> On Thu, Sep 24, 2015 at 4:56 PM, Yichao Yu  > wrote:
>
>> On Thu, Sep 24, 2015 at 4:42 PM, Erik Schnetter > > wrote:
>> > In the native code above, the C function `pow(double, double)` is 
>> called in
>> > both cases. Maybe `llvm_powi` is involved; if so, it is lowered to the 
>> same
>> > `pow` function. The speed difference must have a different reason.
>>
>> Not necessarily, IIRC. we use the openlibm functions by default but
>> llvm will use the system libm version.
>>
>
> Good catch. (I can't reproduce this locally, neither with Julia 0.4 nor 
> 0.5, neither on OS X nor on Linux -- I'm getting different assembler code 
> for both function, both different from the versions shown here, so I can't 
> try my suggestion below.)
>
> To test this, you could comment out or modify the `llvm_powi` definition 
> of `pow`, or you could rebuild Juila without Openlibm.
>
> -erik
>  
>
>> > Sometimes there are random things occurring that invalidate benchmark
>> > results. (This could be caused by how the compiled functions are aligned
>> > respective to cache lines or page boundaries, etc. -- this is black 
>> magic I
>> > like to invoke if there's a result that I can't explain. You can just 
>> ignore
>> > my ramblings here.) You could restart Julia, reboot the machine, try a
>> > different machine, define several identical functions `f` and `f_float` 
>> and
>> > look at their speeds, etc...
>> >
>> > (I would have hoped that this function is translated to the equivalent 
>> of
>> > `c=cos(x); c2=c*c; return c*c2`, but this is obviously not happening.)
>> >
>> > -erik
>> >
>> > On Thu, Sep 24, 2015 at 4:24 PM, Kristoffer Carlsson <
>> kcarl...@gmail.com >
>> > wrote:
>> >>
>> >> But the floating ones are the faster ones. Shouldn't it be the 
>> opposite?
>> >
>> >
>> >
>> >
>> > --
>> > Erik Schnetter 
>> > http://www.perimeterinstitute.ca/personal/eschnetter/
>>
>
>
>
> -- 
> Erik Schnetter  
> http://www.perimeterinstitute.ca/personal/eschnetter/
>


Re: [julia-users] Re: Same native code, different performance

2015-09-25 Thread Kristoffer Carlsson
If you want to reproduce the results above and below you can use JuliaBox.

This has something to do with the constant propagation of sin and cos I 
think. Changing cos to x reverses the results.

f(x) = @fastmath x^3
f2(x) = @fastmath x^3.0

fs(x) = @fastmath cos(x)^3
fs2(x) = @fastmath cos(x)^3.0
function bench(N)
s = 0.0
@time for i = 1:N
s+=f(π/4)
   end
   @time for i = 1:N
s+=f2(π/4)
   end
   @time for i = 1:N
s+=fs(π/4)
   end
   @time for i = 1:N
s+=fs2(π/4)
   end
return s
end

bench(10^6);

  0.001040 seconds
  0.008002 seconds
  0.086514 seconds
  0.015082 seconds


So for f(x) = x^3 the int version is ~8 times faster.
For f(x) = cox(x)^3 the double version is five times faster.

Changing the exponent from 3 to 40 gives.

  0.001048 seconds
  0.092466 seconds
  0.085958 seconds
  0.113476 seconds


where the integer versions run at pretty much the same speed but the double 
versions gets ~10 times slower.

On Thursday, September 24, 2015 at 11:03:21 PM UTC+2, Erik Schnetter wrote:
>
> On Thu, Sep 24, 2015 at 4:56 PM, Yichao Yu  > wrote:
>
>> On Thu, Sep 24, 2015 at 4:42 PM, Erik Schnetter > > wrote:
>> > In the native code above, the C function `pow(double, double)` is 
>> called in
>> > both cases. Maybe `llvm_powi` is involved; if so, it is lowered to the 
>> same
>> > `pow` function. The speed difference must have a different reason.
>>
>> Not necessarily, IIRC. we use the openlibm functions by default but
>> llvm will use the system libm version.
>>
>
> Good catch. (I can't reproduce this locally, neither with Julia 0.4 nor 
> 0.5, neither on OS X nor on Linux -- I'm getting different assembler code 
> for both function, both different from the versions shown here, so I can't 
> try my suggestion below.)
>
> To test this, you could comment out or modify the `llvm_powi` definition 
> of `pow`, or you could rebuild Juila without Openlibm.
>
> -erik
>  
>
>> > Sometimes there are random things occurring that invalidate benchmark
>> > results. (This could be caused by how the compiled functions are aligned
>> > respective to cache lines or page boundaries, etc. -- this is black 
>> magic I
>> > like to invoke if there's a result that I can't explain. You can just 
>> ignore
>> > my ramblings here.) You could restart Julia, reboot the machine, try a
>> > different machine, define several identical functions `f` and `f_float` 
>> and
>> > look at their speeds, etc...
>> >
>> > (I would have hoped that this function is translated to the equivalent 
>> of
>> > `c=cos(x); c2=c*c; return c*c2`, but this is obviously not happening.)
>> >
>> > -erik
>> >
>> > On Thu, Sep 24, 2015 at 4:24 PM, Kristoffer Carlsson <
>> kcarl...@gmail.com >
>> > wrote:
>> >>
>> >> But the floating ones are the faster ones. Shouldn't it be the 
>> opposite?
>> >
>> >
>> >
>> >
>> > --
>> > Erik Schnetter 
>> > http://www.perimeterinstitute.ca/personal/eschnetter/
>>
>
>
>
> -- 
> Erik Schnetter  
> http://www.perimeterinstitute.ca/personal/eschnetter/
>


Re: [julia-users] Re: Same native code, different performance

2015-09-25 Thread Páll Haraldsson
On Thursday, September 24, 2015 at 8:05:52 PM UTC, Jeffrey Sarnoff wrote:
>
> It could be that integer powers are done with binary shifts in software 
> and the floating point powers are computed in the fpu.
>

I suspect not. [At least in this case here, where the numbers to be raised 
to a power are not an integer. It would not make much sense to force 
results of cos or sin to be integers.. :) ]


int^int could avoid the FPU.

float^int would I think (at least for low integers) be slow with binary 
shifts (and more that is needed) as floating point representation is 
hard/slow to emulate in software.


What could be done (and maybe is), is to handle:

float^2, float^3 and up to some small integer and change to floating point 
multiplication. In general, would LLVM take care of such optimizations or 
would Julia have to do it/help?

I'm not sure how fast pow is in an FPU, probably not(?) optimized for these 
simple cases, needs to be general, while MULs can be issued every cycle in 
FPUs commonly (may have some latency, for using result right away).

Such as for float^2.0 or any other literal "float" that is actually an int, 
can be treated my the compiler as an int.

-- 
Palli.



[julia-users] Re: Same native code, different performance

2015-09-24 Thread Simon Danisch
I cannot reproduce this on RC2.
Probably the inlining fails for f on some julia version?

Am Donnerstag, 24. September 2015 18:04:18 UTC+2 schrieb Kristoffer 
Carlsson:
>
> Can someone explain these results to me.
>
> Two functions: 
> f(x) = @fastmath cos(x)^3
> f_float(x) = @fastmath  cos(x)^3.0
>
>
> Identical native code:
>
> julia> code_native(f, (Float64,))
> .text
> Filename: none
> Source line: 1
> pushq   %rbp
> movq%rsp, %rbp
> movabsq $cos, %rax
> Source line: 1
> callq   *%rax
> movabsq $140084090479408, %rax  # imm = 0x7F67DE73A330
> vmovsd  (%rax), %xmm1
> movabsq $pow, %rax
> callq   *%rax
> popq%rbp
> ret
>
> julia> code_native(f_float, (Float64,))
> .text
> Filename: none
> Source line: 1
> pushq   %rbp
> movq%rsp, %rbp
> movabsq $cos, %rax
> Source line: 1
> callq   *%rax
> movabsq $140084090501536, %rax  # imm = 0x7F67DE73F9A0
> vmovsd  (%rax), %xmm1
> movabsq $pow, %rax
> callq   *%rax
> popq%rbp
> ret
>
> Still a large difference in performance:
>
> function bench(N)
> @time for i = 1:N
> f(π/4)
>end
>@time for i = 1:N
>f_float(π/4)
>end
> end
>
> julia> bench(10^6)
>   0.062536 seconds
>   0.010077 seconds
>
> Secondly, can someone explain why there should be a performance difference 
> at all? Is power by a float which is == an int defined differently? IEEE 
> shenanigans?
>


Re: [julia-users] Re: Same native code, different performance

2015-09-24 Thread Mauro
I dissected the bench-method into two, just to be sure (on 0.4-RC2).

julia> function bench(N)
  for i = 1:N
   f(π/4)
  end
   end
bench (generic function with 1 method)

julia> function bench_f(N)
  for i = 1:N
   f_float(π/4)
  end
   end
bench_f (generic function with 1 method)

They also have identical native code but run differently:

julia> @time bench_f(10^7)
  0.190613 seconds (5 allocations: 176 bytes)

julia> @time bench(10^7)
  0.780212 seconds (5 allocations: 176 bytes)

I thought that @code_native shows the code which is actually run, so why
different speeds?

If I define the f* functions without the @fastmath macro, then I get
the same performance as above:

julia> @time bench_f(10^7)
  0.203071 seconds (5 allocations: 176 bytes)

julia> @time bench(10^7)
  0.787696 seconds (5 allocations: 176 bytes)

but with different native-codes.

> I can reproduce... I think the 2 versions will call these methods
> respectively... I guess there's a performance difference?
>
> pow_fast{T<:FloatTypes}(x::T, y::Integer) =
>> box(T, Base.powi_llvm(unbox(T,x), unbox(Int32,Int32(y
>>
>
>
>> pow_fast(x::Float64, y::Float64) =
>> ccall(("pow",libm), Float64, (Float64,Float64), x, y)
>

Tom, or are those two functions called within the native-code?  I'm no
good assembler reader.


Re: [julia-users] Re: Same native code, different performance

2015-09-24 Thread Kristoffer Carlsson
I think Tom iir right here. These lines call the pow function

movabsq $pow, %rax
callq   *%rax

but the actual pow functions that is being called is different. I am 
surprised it is that much of a difference in performance between the two 
pow functions... That seems odd.

What Mauro says is also interesting that the speed difference is there (and 
is as large) even without the fastmath macro.

My question now is, what does IEEE say about x^double vs x^int. Is there 
any reason these should have different performance? If not, it seems to 
make sense to always convert the exponent to a double and call the libm 
version? All doubles should be able to exactly represent the integers that 
the power function take?


On Thursday, September 24, 2015 at 9:18:45 PM UTC+2, Mauro wrote:
>
> I dissected the bench-method into two, just to be sure (on 0.4-RC2). 
>
> julia> function bench(N) 
>   for i = 1:N 
>f(π/4) 
>   end 
>end 
> bench (generic function with 1 method) 
>
> julia> function bench_f(N) 
>   for i = 1:N 
>f_float(π/4) 
>   end 
>end 
> bench_f (generic function with 1 method) 
>
> They also have identical native code but run differently: 
>
> julia> @time bench_f(10^7) 
>   0.190613 seconds (5 allocations: 176 bytes) 
>
> julia> @time bench(10^7) 
>   0.780212 seconds (5 allocations: 176 bytes) 
>
> I thought that @code_native shows the code which is actually run, so why 
> different speeds? 
>
> If I define the f* functions without the @fastmath macro, then I get 
> the same performance as above: 
>
> julia> @time bench_f(10^7) 
>   0.203071 seconds (5 allocations: 176 bytes) 
>
> julia> @time bench(10^7) 
>   0.787696 seconds (5 allocations: 176 bytes) 
>
> but with different native-codes. 
>
> > I can reproduce... I think the 2 versions will call these methods 
> > respectively... I guess there's a performance difference? 
> > 
> > pow_fast{T<:FloatTypes}(x::T, y::Integer) = 
> >> box(T, Base.powi_llvm(unbox(T,x), unbox(Int32,Int32(y 
> >> 
> > 
> > 
> >> pow_fast(x::Float64, y::Float64) = 
> >> ccall(("pow",libm), Float64, (Float64,Float64), x, y) 
> > 
>
> Tom, or are those two functions called within the native-code?  I'm no 
> good assembler reader. 
>


Re: [julia-users] Re: Same native code, different performance

2015-09-24 Thread Erik Schnetter
In the native code above, the C function `pow(double, double)` is called in
both cases. Maybe `llvm_powi` is involved; if so, it is lowered to the same
`pow` function. The speed difference must have a different reason.

Sometimes there are random things occurring that invalidate benchmark
results. (This could be caused by how the compiled functions are aligned
respective to cache lines or page boundaries, etc. -- this is black magic I
like to invoke if there's a result that I can't explain. You can just
ignore my ramblings here.) You could restart Julia, reboot the machine, try
a different machine, define several identical functions `f` and `f_float`
and look at their speeds, etc...

(I would have hoped that this function is translated to the equivalent of
`c=cos(x); c2=c*c; return c*c2`, but this is obviously not happening.)

-erik

On Thu, Sep 24, 2015 at 4:24 PM, Kristoffer Carlsson 
wrote:

> But the floating ones are the faster ones. Shouldn't it be the opposite?




-- 
Erik Schnetter 
http://www.perimeterinstitute.ca/personal/eschnetter/


Re: [julia-users] Re: Same native code, different performance

2015-09-24 Thread Kristoffer Carlsson
I don't like to invoke the black magic card here. I have tried benchmarking in 
different ways in different scenarios and the results are consistent. It is 
also reproducable by others. 

FWIW this is what lead me to this 
https://github.com/JuliaDiff/ForwardDiff.jl/issues/57

Re: [julia-users] Re: Same native code, different performance

2015-09-24 Thread Kristoffer Carlsson
But the floating ones are the faster ones. Shouldn't it be the opposite? 

Re: [julia-users] Re: Same native code, different performance

2015-09-24 Thread Yichao Yu
On Thu, Sep 24, 2015 at 4:42 PM, Erik Schnetter  wrote:
> In the native code above, the C function `pow(double, double)` is called in
> both cases. Maybe `llvm_powi` is involved; if so, it is lowered to the same
> `pow` function. The speed difference must have a different reason.

Not necessarily, IIRC. we use the openlibm functions by default but
llvm will use the system libm version.

>
> Sometimes there are random things occurring that invalidate benchmark
> results. (This could be caused by how the compiled functions are aligned
> respective to cache lines or page boundaries, etc. -- this is black magic I
> like to invoke if there's a result that I can't explain. You can just ignore
> my ramblings here.) You could restart Julia, reboot the machine, try a
> different machine, define several identical functions `f` and `f_float` and
> look at their speeds, etc...
>
> (I would have hoped that this function is translated to the equivalent of
> `c=cos(x); c2=c*c; return c*c2`, but this is obviously not happening.)
>
> -erik
>
> On Thu, Sep 24, 2015 at 4:24 PM, Kristoffer Carlsson 
> wrote:
>>
>> But the floating ones are the faster ones. Shouldn't it be the opposite?
>
>
>
>
> --
> Erik Schnetter 
> http://www.perimeterinstitute.ca/personal/eschnetter/


Re: [julia-users] Re: Same native code, different performance

2015-09-24 Thread Erik Schnetter
On Thu, Sep 24, 2015 at 4:56 PM, Yichao Yu  wrote:

> On Thu, Sep 24, 2015 at 4:42 PM, Erik Schnetter 
> wrote:
> > In the native code above, the C function `pow(double, double)` is called
> in
> > both cases. Maybe `llvm_powi` is involved; if so, it is lowered to the
> same
> > `pow` function. The speed difference must have a different reason.
>
> Not necessarily, IIRC. we use the openlibm functions by default but
> llvm will use the system libm version.
>

Good catch. (I can't reproduce this locally, neither with Julia 0.4 nor
0.5, neither on OS X nor on Linux -- I'm getting different assembler code
for both function, both different from the versions shown here, so I can't
try my suggestion below.)

To test this, you could comment out or modify the `llvm_powi` definition of
`pow`, or you could rebuild Juila without Openlibm.

-erik


> > Sometimes there are random things occurring that invalidate benchmark
> > results. (This could be caused by how the compiled functions are aligned
> > respective to cache lines or page boundaries, etc. -- this is black
> magic I
> > like to invoke if there's a result that I can't explain. You can just
> ignore
> > my ramblings here.) You could restart Julia, reboot the machine, try a
> > different machine, define several identical functions `f` and `f_float`
> and
> > look at their speeds, etc...
> >
> > (I would have hoped that this function is translated to the equivalent of
> > `c=cos(x); c2=c*c; return c*c2`, but this is obviously not happening.)
> >
> > -erik
> >
> > On Thu, Sep 24, 2015 at 4:24 PM, Kristoffer Carlsson <
> kcarlsso...@gmail.com>
> > wrote:
> >>
> >> But the floating ones are the faster ones. Shouldn't it be the opposite?
> >
> >
> >
> >
> > --
> > Erik Schnetter 
> > http://www.perimeterinstitute.ca/personal/eschnetter/
>



-- 
Erik Schnetter 
http://www.perimeterinstitute.ca/personal/eschnetter/