[julia-users] Simple benchmark improvement question

2014-09-30 Thread dextorious
Greetings,

I'm a reasonably proficient user of MATLAB and Python/NumPy/SciPy doing 
computational physics. Since Julia appears to be designed to be very well 
suited to many such applications, I was curious to test its performance 
before investing much time in converting any research code. To start out, I 
wrote up the classic 2D regular finite difference Laplace benchmark in 
Julia, Python and MATLAB in both vectorized and loop versions and tested 
them all. 

The results shown in the following Google spreadsheet (all results obtained 
using a 5000x5000 grid and doing 100 iterations for a reasonable sample 
using a Haswell i7-4710HQ CPU on Windows 8.1, using Julia 0.3.1, Anaconda 
2.0.1 and MATLAB R2014a):
https://docs.google.com/spreadsheets/d/1mJ8wNiyYVszkVapRVHvRJZG9j9XQhLLPvaUJwiWrJXY/pubhtml

The code itself is published as follows:
Julia: http://pastebin.com/AAdXXYZC
Python: http://pastebin.com/5hqi9xzf

As can be clearly seen, Julia does handily beat both MATLAB and basic 
Python/NumPy. It does however lose by a factor of 1.65 to a Numba-jitted 
version of the same Python code (obtained by simply adding a 
@jit(target='cpu') decorator on top of the appropriate function in naive 
Python code), which compiles to the same LLVM stack Julia uses. I 
deliberately avoided using more complex JIT techniques for Python (such as 
using Pythran to compile to OpenMP-enabled C code by specifying function 
signatures) to stick to single core performance only.

Given these results and the virtual given that my Julia code is naive, 
non-idiomatic and just plain bad, I'd like to know if there's anything I 
could improve to match or (if possible) beat JIT-compiled Python. 


Re: [julia-users] Simple benchmark improvement question

2014-09-30 Thread Stefan Karpinski
Your code looks quite good – and the devectorized version avoids creating
copies of slices, which is currently one of the major performance issues
with this kind of code (to be fixed in the next major release). You can see
the inferred types of all the local variables like this: (@code_typed
laplace_unvectorized())[1].args[2][2] – and everything has a concrete type,
so there's nothing to improve in terms of typing.

Adding the @inbounds annotation to the innermost for loop was one of the
first things I thought to try, but that doesn't seem to have any benefit –
I think the array accesses are all inlined and LLVM can hoist the bounds
checks out of the loop (or eliminate them entirely).

What does end up making a big difference is swapping the iteration order of
i and j. Instead of doing `for i=2:nx-1, j=2:ny-1` do `for j=2:ny-1,
i=2:nx-1` – I don't recall whether NumPy arrays are row-major (I think they
are) but Julia is column-major like Fortran and you want to iterate along
columns innermost. On my machine, doing it in the original order is 1.61x
slower, which may be exactly the explanation of the difference you're
seeing between Julia and Numba.

On Tue, Sep 30, 2014 at 7:08 PM, dextori...@gmail.com wrote:

 Greetings,

 I'm a reasonably proficient user of MATLAB and Python/NumPy/SciPy doing
 computational physics. Since Julia appears to be designed to be very well
 suited to many such applications, I was curious to test its performance
 before investing much time in converting any research code. To start out, I
 wrote up the classic 2D regular finite difference Laplace benchmark in
 Julia, Python and MATLAB in both vectorized and loop versions and tested
 them all.

 The results shown in the following Google spreadsheet (all results
 obtained using a 5000x5000 grid and doing 100 iterations for a reasonable
 sample using a Haswell i7-4710HQ CPU on Windows 8.1, using Julia 0.3.1,
 Anaconda 2.0.1 and MATLAB R2014a):

 https://docs.google.com/spreadsheets/d/1mJ8wNiyYVszkVapRVHvRJZG9j9XQhLLPvaUJwiWrJXY/pubhtml

 The code itself is published as follows:
 Julia: http://pastebin.com/AAdXXYZC
 Python: http://pastebin.com/5hqi9xzf

 As can be clearly seen, Julia does handily beat both MATLAB and basic
 Python/NumPy. It does however lose by a factor of 1.65 to a Numba-jitted
 version of the same Python code (obtained by simply adding a
 @jit(target='cpu') decorator on top of the appropriate function in naive
 Python code), which compiles to the same LLVM stack Julia uses. I
 deliberately avoided using more complex JIT techniques for Python (such as
 using Pythran to compile to OpenMP-enabled C code by specifying function
 signatures) to stick to single core performance only.

 Given these results and the virtual given that my Julia code is naive,
 non-idiomatic and just plain bad, I'd like to know if there's anything I
 could improve to match or (if possible) beat JIT-compiled Python.