Re: [julia-users] Re: A Very Simple Benchmark for Brutal-force Loops in Several Languages: revised, Julia is fast!

2016-07-26 Thread dextorious
Since I contributed the Numba JIT timing earlier in the thread, it seems 
only fair to note that the modified Julia version with the properly 
preallocated data is now 17% faster than the Numba version on my computer. 
Overall, this seems to support my thesis that good Julia code is on par or 
slightly faster than the Python code using the JIT libraries (Numba, Hope, 
Parakeet, etc.) with the added benefit of not requiring external libraries. 
I've also seen some cases where the difference is even more pronounced in 
the favor of Julia because of the ability to use @simd to prod the LLVM 
vectorizer along and the ability to exert precise control over memory 
allocations, but this is highly use case specific.

On Tuesday, July 26, 2016 at 12:47:53 AM UTC+3, Tim Holy wrote:
>
> Given the apparent interest in the topic and the decisions that people 
> seem to 
> be making, it seems worth pointing out that folks are still using 
> apples-to- 
> oranges comparisons on this benchmark. 
>
> There are at least two important differences: 
> - in the other languages, `linspace` allocates a vector, but in Julia it 
> (currently) creates a compact object from which values are computed 
> on-the-fly. 
> That computation involves a division, and division is slow. 
> - The languages like C aren't doing bounds-checking. You might imagine 
> adding 
> `@inbounds` to the Julia version. But `@inbounds` has no impact on 
> LinSpace 
> objects in julia 0.4. In julia 0.5, it does work like you'd expect 
> (thanks, 
> Blake Johnson). 
>
> Combining these observations, we can `collect` the values into a `Vector` 
> and 
> then use `@inbounds`. For me the version below is nearly twice as fast as 
> the 
> original: 
>
> function benchmark() 
> nsamples = 100 
> x = collect(linspace(0, 5, nsamples)) 
> y = zeros(nsamples) 
> # attempt to trigger JIT to compile all functions needed in the loops 
> # before profiling 
> a = cos(0.0); a = 1.5*2.5; a = 1.5+2.5; 
> println("\nBrutal-force loops, 100 times:") 
> @time begin 
> for m = 1:100 
> @inbounds for n = 1:nsamples 
> y[n] = cos(2*x[n]+5); 
> end 
> end 
> end 
> end 
> benchmark(); 
>
> Best, 
> --Tim 
>
> On Monday, July 25, 2016 2:02:41 PM CDT Zhong Pan wrote: 
> > Agree that while raw speed is important, in most situations it wouldn't 
> be 
> > the most important reason to choose one programming language over 
> another. 
> > 
> > I came from the angle of an engineer in a small company. For myself, the 
> > main attraction of Julia was the easiness to achieve decent speed 
> without 
> > making much explicit effort: that means what feels more natural 
> vectorized 
> > will be vectorized, while what feels more natural in a loop will be in a 
> > loop; that means I don't need to resort to another language or a library 
> > only for improving speed; and that means apart from sticking to a couple 
> > good habits, I can use objects, functions etc. the same way inside a 
> loop 
> > vs. outside. None of these is critical by itself, but they add up to an 
> > uninterrupted flow of thoughts while writing code to explore, try, fail, 
> > and retry, for many iterations. 
> > 
> > During this "less careful" prototyping, 1-2x slow down is fine, but with 
> > Julia I know I won't sit there for tens of minutes waiting for a result 
> > while debating myself whether I should rewrite it in C++ or rehaul the 
> code 
> > with Cython etc.; instead I can rest assured that as long as my algo and 
> > coding have no mistakes or major flaws, the speed is close to what I 
> will 
> > get even if I make several times more effort to rewrite it in C++. 
> > 
> > Another big deal for me is the resulted removal of the barrier between 
> > prototype and production code. For production I can review and improve 
> my 
> > code carefully, but rewriting it in a less expressive language is too 
> much. 
> > 
> > I was a huge fan of Python (heck I even persuaded my previous boss, a 
> VP, 
> > to pick up Python - though I don't know if he really had time to finish 
> it. 
> > 
> > :-)). However, the slow raw speed and the over-freedom to change class 
> > 
> > definition anywhere always gave me the itch to find something better. My 
> > brother at JPL who worked on Python projects also complained about 
> having 
> > to think really hard to vectorize almost everything and then couldn't 
> > easily understand what he was doing a few months later because the code 
> was 
> > too unnatural for the problem; the indentation was also a big headache 
> as 
> > collaborators use different editors with different tab definitions. 
> > 
> > So I'm really happy to have found Julia, which gave me the same joy as 
> > coding in Python and removed the main itches. 
> > 
> > -Zhong 
>
>
>

[julia-users] Re: A Very Simple Benchmark for Brutal-force Loops in Several Languages: revised, Julia is fast!

2016-07-25 Thread dextorious
I haven't done any systematic benchmarking since Numba introduced the 
ability to JIT compile entire classes. In my experience, very well written 
Julia code is usually equivalent or better (in cases when @simd is helpful) 
compared to Numba JIT'd code. The Python code is sometimes easier to write, 
since Numba takes care of everything, but it's a double edged sword - if 
you run into a case where Numba doesn't work well or at all, you're just 
out of luck. In my personal view, the availability of Numba and other 
libraries just means Python vs Julia performance comparisons aren't 
particularly relevant, you should pick the language you prefer, not because 
you think it's faster, whereas if absolute performance is the only metric, 
you have to resort to Fortran/C++ anyway. I'm writing most of my code in 
Julia because I prefer the type system and multiple dispatch to Python's 
OOP, but I don't think it's meaningfully faster as long as you're willing 
to use appropriate libraries.

The one outlier is the fact that GPU programming is much easier in Python 
at the moment. Hopefully that will change soon, some of the progress in 
ArrayFire.jl and CUDA libraries is very promising.

On Monday, July 25, 2016 at 7:34:04 PM UTC+3, dnm wrote:
>
> Interesting. Did you use the updated Julia code? 
>
> Have you done any comparisons between reading and writing  Numba JIT 
> classes and Julia types in tight loops?
>
> On Monday, July 25, 2016 at 10:41:48 AM UTC-4, dexto...@gmail.com wrote:
>>
>> Just for the sake of comprehensiveness, I ran your Python benchmark 
>> through the Numba JIT library (which uses the same underlying LLVM 
>> infrastructure that Julia does) and on my computer the Python code is 
>> faster than Julia by 68%. Vanilla CPython is terrible for this kind of 
>> simple explicit loop code, but Numba and other JIT libraries largely solve 
>> that issue with minimal effort as long as the code is simple enough. That 
>> by no means solves all of Python's issues in the context of numerical 
>> programming and I'm sure the Julia benchmark could be improved as others 
>> have already mentioned, but benchmarking Python this way isn't necessarily 
>> representative of how a performance-conscious programmer would reasonably 
>> approach a problem of this kind.
>>
>

[julia-users] Re: A Very Simple Benchmark for Brutal-force Loops in Several Languages: revised, Julia is fast!

2016-07-25 Thread dextorious
Just for the sake of comprehensiveness, I ran your Python benchmark through 
the Numba JIT library (which uses the same underlying LLVM infrastructure 
that Julia does) and on my computer the Python code is faster than Julia by 
68%. Vanilla CPython is terrible for this kind of simple explicit loop 
code, but Numba and other JIT libraries largely solve that issue with 
minimal effort as long as the code is simple enough. That by no means 
solves all of Python's issues in the context of numerical programming and 
I'm sure the Julia benchmark could be improved as others have already 
mentioned, but benchmarking Python this way isn't necessarily 
representative of how a performance-conscious programmer would reasonably 
approach a problem of this kind.


[julia-users] Re: Can you make this linear algebra code faster?

2015-03-25 Thread dextorious
I hope to look at this when I get some time, but as a preliminary note, 
merely applying the @inbounds and @simd macros to the main for loop yields 
an increase in performance of about 15-20% on my machine.


[julia-users] Trying to optimize numerical code - peculiar allocation data

2014-10-24 Thread dextorious
So, I've been gradually porting some of my more performance sensitive 
MATLAB research code to Julia, with great results so far. One of my current 
projects spends about 80% of the computation time calculating the 
Rotne-Prager-Yamakawa tensor at each time step, so I'm quite motivated to 
optimize that particular step of the computation as much as possible. 
Rewriting the code in Julia alone resulted in an order of magnitude 
speedup. Writing my own implementation of the dyadic product of two vectors 
gave me another factor of 2 (which I found rather surprising, but still), 
as did heavy use of @simd and @inbounds annotations where appropriate. 
Despite these successes, I still feel like there is more performance to be 
gained here. For one, the code still spends a lot of the time on garbage 
collection. Also, running julia with --track-allocation=user gave me some 
perplexing results.

First things first, the relevant code:
function outer_product(a, b, n)
c = zeros(n, n)
@simd for i = 1 : n
@simd for j = 1 : n
@inbounds c[i, j] = a[i] * b[j]
end
end
c
end

function rotne_prager(x, y, z, r)
const p = length(x) - 1
rpy = zeros(3*(p+1), 3*(p+1))
rvec = zeros(3)
@simd for i = 1 : p+1
@simd for j = 1 : p+1
@inbounds rvec[1] = x[j] - x[i]
@inbounds rvec[2] = y[j] - y[i]
@inbounds rvec[3] = z[j] - z[i]
Rij = sqrt(rvec[1]*rvec[1] + rvec[2]*rvec[2] + rvec[3]*rvec[3])
distance_ratio = r / Rij
if (Rij  2*r)
@inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (0.75 * 
distance_ratio *
((1.0 + 2.0/3.0 * distance_ratio*distance_ratio) * eye(3
) +
(1.0 - 2.0*distance_ratio*distance_ratio) * 
outer_product(rvec, rvec, 3)))
else
@inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (
(1.0 - 9.0/32.0 / distance_ratio) * eye(3) +
3.0/32.0 / distance_ratio * outer_product(rvec, rvec, 3
))
end
end
end
rpy
end

function rotne_bench(n)
x = rand(n)
y = rand(n)
z = rand(n)
r = 0.01
rotne_prager(x, y, z, r);
end

rotne_bench(1000);
@time (for i=1:10; rotne_bench(1000); end)

Running this yields the following output on one of my machines:
elapsed time: 13.440131479 seconds (8718927056 bytes allocated, 36.99% gc 
time)

I then tried to add proper type annotations everywhere, but this did not 
change anything in the results, so I'm assuming Julia's type inference 
system is doing a great job here. Next, I tried to look at the allocations, 
using the --track-allocation=user switch (by the way, this did not seem to 
produce a *jl.mem file anywhere on my Windows machines, but worked fine on 
a Linux box, is this a known behavior?). This yielded the following output:
- function outer_product(a, b, n)
-70967296 c = zeros(n, n)
0 @simd for i = 1 : n
0 @simd for j = 1 : n
0 @inbounds c[i, j] = a[i] * b[j]
- end
- end
0 c
- end
- 
- function rotne_prager(x, y, z, r)
0 const p = length(x) - 1
792000528 rpy = zeros(3*(p+1), 3*(p+1))
  880 rvec = zeros(3)
0 @simd for i = 1 : p+1
0 @simd for j = 1 : p+1
0 @inbounds rvec[1] = x[j] - x[i]
0 @inbounds rvec[2] = y[j] - y[i]
0 @inbounds rvec[3] = z[j] - z[i]
0 Rij = sqrt(rvec[1]*rvec[1] + rvec[2]*rvec[2] + rvec[3
]*rvec[3])
0 distance_ratio = r / Rij
0 if (Rij  2*r)
0 @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (0.75 * 
distance_ratio *
- ((1.0 + 2.0/3.0 * distance_ratio*
distance_ratio) * eye(3) +
- (1.0 - 2.0*distance_ratio*distance_ratio) * 
outer_product(rvec, rvec, 3)))
- else
279582208 @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (
- (1.0 - 9.0/32.0 / distance_ratio) * eye(3) +
- 3.0/32.0 / distance_ratio * outer_product(rvec
, rvec, 3))
- end
- end
- end
0 rpy
0 end
- 
0 function rotne_bench(n)
0 x = rand(n)
0 y = rand(n)
0 z = rand(n)
0 r = 0.01
0 rotne_prager(x, y, z, r);
- end
- 
- rotne_bench(1000);
0 @time (for i=1:10; rotne_bench(1000); end)

I find myself somewhat confused by these results, specifically line 27. Why 
would the second expression constructing the rpy matrix cause a huge number 
of allocations when the first did not? They seem virtually identical from a 
programming point of view, with minor differences in numerical 
coefficients. I feel like, in my limited understanding of a 

Re: [julia-users] Trying to optimize numerical code - peculiar allocation data

2014-10-24 Thread dextorious
I'm absolutely sure that Rij  2*r most of the time, yes. This is obvious 
if you consider that rand(n) returns values in the range [0, 1] and r = 
0.01. I have also verified this empirically. I'll try to explore the view 
update idea further, I've never used that approach before.


On Friday, October 24, 2014 5:46:37 PM UTC+3, Andreas Noack wrote:

 Are you sure that Rij  2*r in your calculations? 

 You should use I instead of eye(3), but probably the best solution is to 
 avoid the explicit calculation of the outer product. You can probably write 
 your problem as a rank one update of a view of pry

 Med venlig hilsen

 Andreas Noack



Re: [julia-users] Trying to optimize numerical code - peculiar allocation data

2014-10-24 Thread dextorious
First of all, thank you for a very comprehensive and informative reply. I 
re-checked the performance impact of @simd/@inbounds and it seems you're 
right. The annotations do provide a measurable performance boost on longer 
calculations, but it's slightly under 2% on average and so almost 
negligible. I suspect I must have made other optimizations between 
modifying the annotations previously, leading to an incorrect conclusion.

I tried using the BLAS function for the outer product and it does get 
fairly close to my custom kernel, though not quite there. It's great to see 
how easy Julia makes it to follow method calls that would be quite 
difficult to track down in other languages. I did, however, find a way to 
make my outer product function even faster by passing a preallocated 3x3 
matrix and modifying the elements to avoid repeating the zeros(3,3) call 
every time. This also reduced allocation numbers accordingly.

Using the I statement instead of eye(3) as you suggested gave me a very 
considerable performance boost and I hadn't seen it mentioned anywhere in 
the Julia docs. I'll have to go through the source to see how it does its 
magic, the difference is quite remarkable to me. It's also interesting that 
putting the conditional inside the assignment makes the code run faster 
compared to the other way around, but I can't argue with the results. On 
average, these two changes gave me a steady 40-50% performance boost, which 
is great.

The one thing I was still dissatisfied with was that almost a third of the 
total runtime was still spent on garbage collection, so I re-ran the 
allocation numbers and got the following results:
- function outer_product!(a, b, n, c)
0 for i=1:n, j=1:n
0 c[j,i] = a[j] * b[i]
- end
0 c
- end
- 
- #outer_product(a, b, n) = Base.LinAlg.BLAS.ger!(1.0, a, b, 
zeros(n, n))
- 
- function rotne_prager(x, y, z, r)
0 const p = length(x) - 1
792000528 rpy = zeros(3*(p+1), 3*(p+1))
  880 rvec = zeros(3)
 1408 c = zeros(3,3)
1406543616 for i=1:p+1, j=1:p+1
0 rvec[1] = x[j] - x[i]
0 rvec[2] = y[j] - y[i]
0 rvec[3] = z[j] - z[i]
0 Rij = sqrt(rvec[1]*rvec[1] + rvec[2]*rvec[2] + rvec[3]*
rvec[3])
0 distance_ratio = r / Rij
0 rpy[3*j-2:3*j, 3*i-2:3*i] = if (Rij  2*r)
0 (0.75 * distance_ratio *
- ((1.0 + 2.0/3.0 * distance_ratio*distance_ratio) * 
I +
- (1.0 - 2.0*distance_ratio*distance_ratio) * 
outer_product!(rvec, rvec, 3, c)))
0 else ((1.0 - 9.0/32.0 / distance_ratio) * I +
- 3.0/32.0 / distance_ratio * outer_product!(rvec, 
rvec, 3, c))
- end
- end
0 rpy
- end
- 
- function rotne_bench(n)
0 x = rand(n)
0 y = rand(n)
0 z = rand(n)
0 r = 0.01
0 rotne_prager(x, y, z, r);
- end
- 
- rotne_bench(1000);
- @time (for i=1:10; rotne_bench(1000); end)


The one thing that seems odd to me is that huge number next to the for 
loop. Surely the iterator variables allocating during the loop are tiny and 
every other variable except for the two Float64s (Rij and distance_ratio) 
is pre-allocated. Does this mean that a large number of temporary variables 
are generated during the rpy expressions? Intuitively, it seems to be that 
a fairly simple code such as this shouldn't have to spend 30% time on gc(), 
but perhaps I'm wrong.

On Friday, October 24, 2014 7:38:36 PM UTC+3, Jiahao Chen wrote:

 Are you sure that @simd and @inbounds give you any performance gains? On 
 my laptop removing all those annotations resulted in essentially no change 
 in performance.

 Computing the outer product as a*b' is currently slow because it does not 
 use the specialized method for computing outer products. Instead, b' turns 
 the n-vector into a 1xn-row matrix, then a*b' turns a from a n-vector to a 
 nx1-column matrix, then does generic matrix-matrix multiplication (dgemm). 
 You can verify this yourself by running

 @which a*b'

 and looking at the method of * that is being dispatched on.

 As Andreas said, you can try the rank-1 update; one way to use BLAS's 
 function for this is

 outer_product(a, b, n) = Base.LinAlg.BLAS.ger!(1.0, a, b, zeros(n, n))

 but for 3x3 outer products writing your own kernel may very well be faster 
 because you avoid the cost of making a library call.

 For me the biggest speedup came from rewriting your inner kernel as

 rpy[3*j-2:3*j, 3*i-2:3*i] = if (Rij  2*r)
 (0.75 * distance_ratio *
 ((1.0 + 2.0/3.0 * distance_ratio*distance_ratio) * I +
 (1.0 - 2.0*distance_ratio*distance_ratio) * 
 outer_product(rvec, 

Re: [julia-users] Trying to optimize numerical code - peculiar allocation data

2014-10-24 Thread dextorious
I'm currently avoiding the use of FMMs because the systems I run are fairly 
small (colloidal soft matter, typically involving a few hundreds of spheres 
in a small number of aggregates, so each time step is quite fast, but the 
dynamics take a long time to resolve) and I'd rather not sacrifice any 
precision if I don't have to. Neighbor lists might be an interesting option 
to consider, however. 

I'm not entirely sure what would be gained by providing a function that 
computes the tensor's action on a vector instead of computing it directly. 
In a Stokes flow, the velocity of an arbitrary assortment of spheres can be 
approximately expressed as v = rpy * force_density, which is basically what 
I'm doing in this case. Assuming the force_density vector isn't sparse, 
wouldn't you have to compute every element of the tensor anyway?

On Friday, October 24, 2014 7:43:56 PM UTC+3, Jiahao Chen wrote:

 Without knowing much more about your specific problem and use cases, you 
 may also want to consider:

 - never computing the tensor directly, but provide an indirect 
 representation of it by providing a function that  computes its action on a 
 vector
 - using a fast multipole algorithm to determine the distance cutoff
 - reducing the number of explicit distance computations using neighbor 
 lists

 Thanks,

 Jiahao Chen
 Staff Research Scientist
 MIT Computer Science and Artificial Intelligence Laboratory

 On Fri, Oct 24, 2014 at 12:38 PM, Jiahao Chen jia...@mit.edu 
 javascript: wrote:

 Are you sure that @simd and @inbounds give you any performance gains? On 
 my laptop removing all those annotations resulted in essentially no change 
 in performance.

 Computing the outer product as a*b' is currently slow because it does not 
 use the specialized method for computing outer products. Instead, b' turns 
 the n-vector into a 1xn-row matrix, then a*b' turns a from a n-vector to a 
 nx1-column matrix, then does generic matrix-matrix multiplication (dgemm). 
 You can verify this yourself by running

 @which a*b'

 and looking at the method of * that is being dispatched on.

 As Andreas said, you can try the rank-1 update; one way to use BLAS's 
 function for this is

 outer_product(a, b, n) = Base.LinAlg.BLAS.ger!(1.0, a, b, zeros(n, n))

 but for 3x3 outer products writing your own kernel may very well be 
 faster because you avoid the cost of making a library call.

 For me the biggest speedup came from rewriting your inner kernel as

 rpy[3*j-2:3*j, 3*i-2:3*i] = if (Rij  2*r)
 (0.75 * distance_ratio *
 ((1.0 + 2.0/3.0 * distance_ratio*distance_ratio) * I +
 (1.0 - 2.0*distance_ratio*distance_ratio) * 
 outer_product(rvec, rvec, 3)))
 else
 (1.0 - 9.0/32.0 / distance_ratio) * I +
 3.0/32.0 / distance_ratio * outer_product(rvec, rvec, 
 3)
 end

 - The I symbol is an abstract identity matrix that avoids allocating an 
 explicit matrix.
 - The compiler seems happier when I move the assignment outside the if 
 statement.

 Thanks,

 Jiahao Chen
 Staff Research Scientist
 MIT Computer Science and Artificial Intelligence Laboratory

 On Fri, Oct 24, 2014 at 11:24 AM, dexto...@gmail.com javascript: 
 wrote:

 I'm absolutely sure that Rij  2*r most of the time, yes. This is 
 obvious if you consider that rand(n) returns values in the range [0, 1] and 
 r = 0.01. I have also verified this empirically. I'll try to explore the 
 view update idea further, I've never used that approach before.


 On Friday, October 24, 2014 5:46:37 PM UTC+3, Andreas Noack wrote:

 Are you sure that Rij  2*r in your calculations? 

 You should use I instead of eye(3), but probably the best solution is 
 to avoid the explicit calculation of the outer product. You can probably 
 write your problem as a rank one update of a view of pry

 Med venlig hilsen

 Andreas Noack





[julia-users] Re: Simple benchmark improvement question

2014-10-02 Thread dextorious
Thanks for the quick feedback, I appreciate it. I had indeed somehow missed 
Julia's arrays being column-major and taking that into account essentially 
yields equal performance between Numba and Julia in this benchmark. I'll 
compare it to Fortran/C++ just to see how large the difference is, but the 
current performance does seem to be the limit of LLVM-based approaches in 
general. 

To answer Tim Holy's point, Numba does do SIMD vectorization, but I haven't 
tested it extensively in the latest releases. For now, I'll port some of 
the more compact simulations I'm doing to Julia and Python/Numba for a less 
trivial benchmark and see how the performance compares then. 


[julia-users] Simple benchmark improvement question

2014-09-30 Thread dextorious
Greetings,

I'm a reasonably proficient user of MATLAB and Python/NumPy/SciPy doing 
computational physics. Since Julia appears to be designed to be very well 
suited to many such applications, I was curious to test its performance 
before investing much time in converting any research code. To start out, I 
wrote up the classic 2D regular finite difference Laplace benchmark in 
Julia, Python and MATLAB in both vectorized and loop versions and tested 
them all. 

The results shown in the following Google spreadsheet (all results obtained 
using a 5000x5000 grid and doing 100 iterations for a reasonable sample 
using a Haswell i7-4710HQ CPU on Windows 8.1, using Julia 0.3.1, Anaconda 
2.0.1 and MATLAB R2014a):
https://docs.google.com/spreadsheets/d/1mJ8wNiyYVszkVapRVHvRJZG9j9XQhLLPvaUJwiWrJXY/pubhtml

The code itself is published as follows:
Julia: http://pastebin.com/AAdXXYZC
Python: http://pastebin.com/5hqi9xzf

As can be clearly seen, Julia does handily beat both MATLAB and basic 
Python/NumPy. It does however lose by a factor of 1.65 to a Numba-jitted 
version of the same Python code (obtained by simply adding a 
@jit(target='cpu') decorator on top of the appropriate function in naive 
Python code), which compiles to the same LLVM stack Julia uses. I 
deliberately avoided using more complex JIT techniques for Python (such as 
using Pythran to compile to OpenMP-enabled C code by specifying function 
signatures) to stick to single core performance only.

Given these results and the virtual given that my Julia code is naive, 
non-idiomatic and just plain bad, I'd like to know if there's anything I 
could improve to match or (if possible) beat JIT-compiled Python.