Re: [julia-users] Re: A Very Simple Benchmark for Brutal-force Loops in Several Languages: revised, Julia is fast!
Since I contributed the Numba JIT timing earlier in the thread, it seems only fair to note that the modified Julia version with the properly preallocated data is now 17% faster than the Numba version on my computer. Overall, this seems to support my thesis that good Julia code is on par or slightly faster than the Python code using the JIT libraries (Numba, Hope, Parakeet, etc.) with the added benefit of not requiring external libraries. I've also seen some cases where the difference is even more pronounced in the favor of Julia because of the ability to use @simd to prod the LLVM vectorizer along and the ability to exert precise control over memory allocations, but this is highly use case specific. On Tuesday, July 26, 2016 at 12:47:53 AM UTC+3, Tim Holy wrote: > > Given the apparent interest in the topic and the decisions that people > seem to > be making, it seems worth pointing out that folks are still using > apples-to- > oranges comparisons on this benchmark. > > There are at least two important differences: > - in the other languages, `linspace` allocates a vector, but in Julia it > (currently) creates a compact object from which values are computed > on-the-fly. > That computation involves a division, and division is slow. > - The languages like C aren't doing bounds-checking. You might imagine > adding > `@inbounds` to the Julia version. But `@inbounds` has no impact on > LinSpace > objects in julia 0.4. In julia 0.5, it does work like you'd expect > (thanks, > Blake Johnson). > > Combining these observations, we can `collect` the values into a `Vector` > and > then use `@inbounds`. For me the version below is nearly twice as fast as > the > original: > > function benchmark() > nsamples = 100 > x = collect(linspace(0, 5, nsamples)) > y = zeros(nsamples) > # attempt to trigger JIT to compile all functions needed in the loops > # before profiling > a = cos(0.0); a = 1.5*2.5; a = 1.5+2.5; > println("\nBrutal-force loops, 100 times:") > @time begin > for m = 1:100 > @inbounds for n = 1:nsamples > y[n] = cos(2*x[n]+5); > end > end > end > end > benchmark(); > > Best, > --Tim > > On Monday, July 25, 2016 2:02:41 PM CDT Zhong Pan wrote: > > Agree that while raw speed is important, in most situations it wouldn't > be > > the most important reason to choose one programming language over > another. > > > > I came from the angle of an engineer in a small company. For myself, the > > main attraction of Julia was the easiness to achieve decent speed > without > > making much explicit effort: that means what feels more natural > vectorized > > will be vectorized, while what feels more natural in a loop will be in a > > loop; that means I don't need to resort to another language or a library > > only for improving speed; and that means apart from sticking to a couple > > good habits, I can use objects, functions etc. the same way inside a > loop > > vs. outside. None of these is critical by itself, but they add up to an > > uninterrupted flow of thoughts while writing code to explore, try, fail, > > and retry, for many iterations. > > > > During this "less careful" prototyping, 1-2x slow down is fine, but with > > Julia I know I won't sit there for tens of minutes waiting for a result > > while debating myself whether I should rewrite it in C++ or rehaul the > code > > with Cython etc.; instead I can rest assured that as long as my algo and > > coding have no mistakes or major flaws, the speed is close to what I > will > > get even if I make several times more effort to rewrite it in C++. > > > > Another big deal for me is the resulted removal of the barrier between > > prototype and production code. For production I can review and improve > my > > code carefully, but rewriting it in a less expressive language is too > much. > > > > I was a huge fan of Python (heck I even persuaded my previous boss, a > VP, > > to pick up Python - though I don't know if he really had time to finish > it. > > > > :-)). However, the slow raw speed and the over-freedom to change class > > > > definition anywhere always gave me the itch to find something better. My > > brother at JPL who worked on Python projects also complained about > having > > to think really hard to vectorize almost everything and then couldn't > > easily understand what he was doing a few months later because the code > was > > too unnatural for the problem; the indentation was also a big headache > as > > collaborators use different editors with different tab definitions. > > > > So I'm really happy to have found Julia, which gave me the same joy as > > coding in Python and removed the main itches. > > > > -Zhong > > >
[julia-users] Re: A Very Simple Benchmark for Brutal-force Loops in Several Languages: revised, Julia is fast!
I haven't done any systematic benchmarking since Numba introduced the ability to JIT compile entire classes. In my experience, very well written Julia code is usually equivalent or better (in cases when @simd is helpful) compared to Numba JIT'd code. The Python code is sometimes easier to write, since Numba takes care of everything, but it's a double edged sword - if you run into a case where Numba doesn't work well or at all, you're just out of luck. In my personal view, the availability of Numba and other libraries just means Python vs Julia performance comparisons aren't particularly relevant, you should pick the language you prefer, not because you think it's faster, whereas if absolute performance is the only metric, you have to resort to Fortran/C++ anyway. I'm writing most of my code in Julia because I prefer the type system and multiple dispatch to Python's OOP, but I don't think it's meaningfully faster as long as you're willing to use appropriate libraries. The one outlier is the fact that GPU programming is much easier in Python at the moment. Hopefully that will change soon, some of the progress in ArrayFire.jl and CUDA libraries is very promising. On Monday, July 25, 2016 at 7:34:04 PM UTC+3, dnm wrote: > > Interesting. Did you use the updated Julia code? > > Have you done any comparisons between reading and writing Numba JIT > classes and Julia types in tight loops? > > On Monday, July 25, 2016 at 10:41:48 AM UTC-4, dexto...@gmail.com wrote: >> >> Just for the sake of comprehensiveness, I ran your Python benchmark >> through the Numba JIT library (which uses the same underlying LLVM >> infrastructure that Julia does) and on my computer the Python code is >> faster than Julia by 68%. Vanilla CPython is terrible for this kind of >> simple explicit loop code, but Numba and other JIT libraries largely solve >> that issue with minimal effort as long as the code is simple enough. That >> by no means solves all of Python's issues in the context of numerical >> programming and I'm sure the Julia benchmark could be improved as others >> have already mentioned, but benchmarking Python this way isn't necessarily >> representative of how a performance-conscious programmer would reasonably >> approach a problem of this kind. >> >
[julia-users] Re: A Very Simple Benchmark for Brutal-force Loops in Several Languages: revised, Julia is fast!
Just for the sake of comprehensiveness, I ran your Python benchmark through the Numba JIT library (which uses the same underlying LLVM infrastructure that Julia does) and on my computer the Python code is faster than Julia by 68%. Vanilla CPython is terrible for this kind of simple explicit loop code, but Numba and other JIT libraries largely solve that issue with minimal effort as long as the code is simple enough. That by no means solves all of Python's issues in the context of numerical programming and I'm sure the Julia benchmark could be improved as others have already mentioned, but benchmarking Python this way isn't necessarily representative of how a performance-conscious programmer would reasonably approach a problem of this kind.
[julia-users] Re: Can you make this linear algebra code faster?
I hope to look at this when I get some time, but as a preliminary note, merely applying the @inbounds and @simd macros to the main for loop yields an increase in performance of about 15-20% on my machine.
[julia-users] Trying to optimize numerical code - peculiar allocation data
So, I've been gradually porting some of my more performance sensitive MATLAB research code to Julia, with great results so far. One of my current projects spends about 80% of the computation time calculating the Rotne-Prager-Yamakawa tensor at each time step, so I'm quite motivated to optimize that particular step of the computation as much as possible. Rewriting the code in Julia alone resulted in an order of magnitude speedup. Writing my own implementation of the dyadic product of two vectors gave me another factor of 2 (which I found rather surprising, but still), as did heavy use of @simd and @inbounds annotations where appropriate. Despite these successes, I still feel like there is more performance to be gained here. For one, the code still spends a lot of the time on garbage collection. Also, running julia with --track-allocation=user gave me some perplexing results. First things first, the relevant code: function outer_product(a, b, n) c = zeros(n, n) @simd for i = 1 : n @simd for j = 1 : n @inbounds c[i, j] = a[i] * b[j] end end c end function rotne_prager(x, y, z, r) const p = length(x) - 1 rpy = zeros(3*(p+1), 3*(p+1)) rvec = zeros(3) @simd for i = 1 : p+1 @simd for j = 1 : p+1 @inbounds rvec[1] = x[j] - x[i] @inbounds rvec[2] = y[j] - y[i] @inbounds rvec[3] = z[j] - z[i] Rij = sqrt(rvec[1]*rvec[1] + rvec[2]*rvec[2] + rvec[3]*rvec[3]) distance_ratio = r / Rij if (Rij 2*r) @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (0.75 * distance_ratio * ((1.0 + 2.0/3.0 * distance_ratio*distance_ratio) * eye(3 ) + (1.0 - 2.0*distance_ratio*distance_ratio) * outer_product(rvec, rvec, 3))) else @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = ( (1.0 - 9.0/32.0 / distance_ratio) * eye(3) + 3.0/32.0 / distance_ratio * outer_product(rvec, rvec, 3 )) end end end rpy end function rotne_bench(n) x = rand(n) y = rand(n) z = rand(n) r = 0.01 rotne_prager(x, y, z, r); end rotne_bench(1000); @time (for i=1:10; rotne_bench(1000); end) Running this yields the following output on one of my machines: elapsed time: 13.440131479 seconds (8718927056 bytes allocated, 36.99% gc time) I then tried to add proper type annotations everywhere, but this did not change anything in the results, so I'm assuming Julia's type inference system is doing a great job here. Next, I tried to look at the allocations, using the --track-allocation=user switch (by the way, this did not seem to produce a *jl.mem file anywhere on my Windows machines, but worked fine on a Linux box, is this a known behavior?). This yielded the following output: - function outer_product(a, b, n) -70967296 c = zeros(n, n) 0 @simd for i = 1 : n 0 @simd for j = 1 : n 0 @inbounds c[i, j] = a[i] * b[j] - end - end 0 c - end - - function rotne_prager(x, y, z, r) 0 const p = length(x) - 1 792000528 rpy = zeros(3*(p+1), 3*(p+1)) 880 rvec = zeros(3) 0 @simd for i = 1 : p+1 0 @simd for j = 1 : p+1 0 @inbounds rvec[1] = x[j] - x[i] 0 @inbounds rvec[2] = y[j] - y[i] 0 @inbounds rvec[3] = z[j] - z[i] 0 Rij = sqrt(rvec[1]*rvec[1] + rvec[2]*rvec[2] + rvec[3 ]*rvec[3]) 0 distance_ratio = r / Rij 0 if (Rij 2*r) 0 @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = (0.75 * distance_ratio * - ((1.0 + 2.0/3.0 * distance_ratio* distance_ratio) * eye(3) + - (1.0 - 2.0*distance_ratio*distance_ratio) * outer_product(rvec, rvec, 3))) - else 279582208 @inbounds rpy[3*j-2:3*j, 3*i-2:3*i] = ( - (1.0 - 9.0/32.0 / distance_ratio) * eye(3) + - 3.0/32.0 / distance_ratio * outer_product(rvec , rvec, 3)) - end - end - end 0 rpy 0 end - 0 function rotne_bench(n) 0 x = rand(n) 0 y = rand(n) 0 z = rand(n) 0 r = 0.01 0 rotne_prager(x, y, z, r); - end - - rotne_bench(1000); 0 @time (for i=1:10; rotne_bench(1000); end) I find myself somewhat confused by these results, specifically line 27. Why would the second expression constructing the rpy matrix cause a huge number of allocations when the first did not? They seem virtually identical from a programming point of view, with minor differences in numerical coefficients. I feel like, in my limited understanding of a
Re: [julia-users] Trying to optimize numerical code - peculiar allocation data
I'm absolutely sure that Rij 2*r most of the time, yes. This is obvious if you consider that rand(n) returns values in the range [0, 1] and r = 0.01. I have also verified this empirically. I'll try to explore the view update idea further, I've never used that approach before. On Friday, October 24, 2014 5:46:37 PM UTC+3, Andreas Noack wrote: Are you sure that Rij 2*r in your calculations? You should use I instead of eye(3), but probably the best solution is to avoid the explicit calculation of the outer product. You can probably write your problem as a rank one update of a view of pry Med venlig hilsen Andreas Noack
Re: [julia-users] Trying to optimize numerical code - peculiar allocation data
First of all, thank you for a very comprehensive and informative reply. I re-checked the performance impact of @simd/@inbounds and it seems you're right. The annotations do provide a measurable performance boost on longer calculations, but it's slightly under 2% on average and so almost negligible. I suspect I must have made other optimizations between modifying the annotations previously, leading to an incorrect conclusion. I tried using the BLAS function for the outer product and it does get fairly close to my custom kernel, though not quite there. It's great to see how easy Julia makes it to follow method calls that would be quite difficult to track down in other languages. I did, however, find a way to make my outer product function even faster by passing a preallocated 3x3 matrix and modifying the elements to avoid repeating the zeros(3,3) call every time. This also reduced allocation numbers accordingly. Using the I statement instead of eye(3) as you suggested gave me a very considerable performance boost and I hadn't seen it mentioned anywhere in the Julia docs. I'll have to go through the source to see how it does its magic, the difference is quite remarkable to me. It's also interesting that putting the conditional inside the assignment makes the code run faster compared to the other way around, but I can't argue with the results. On average, these two changes gave me a steady 40-50% performance boost, which is great. The one thing I was still dissatisfied with was that almost a third of the total runtime was still spent on garbage collection, so I re-ran the allocation numbers and got the following results: - function outer_product!(a, b, n, c) 0 for i=1:n, j=1:n 0 c[j,i] = a[j] * b[i] - end 0 c - end - - #outer_product(a, b, n) = Base.LinAlg.BLAS.ger!(1.0, a, b, zeros(n, n)) - - function rotne_prager(x, y, z, r) 0 const p = length(x) - 1 792000528 rpy = zeros(3*(p+1), 3*(p+1)) 880 rvec = zeros(3) 1408 c = zeros(3,3) 1406543616 for i=1:p+1, j=1:p+1 0 rvec[1] = x[j] - x[i] 0 rvec[2] = y[j] - y[i] 0 rvec[3] = z[j] - z[i] 0 Rij = sqrt(rvec[1]*rvec[1] + rvec[2]*rvec[2] + rvec[3]* rvec[3]) 0 distance_ratio = r / Rij 0 rpy[3*j-2:3*j, 3*i-2:3*i] = if (Rij 2*r) 0 (0.75 * distance_ratio * - ((1.0 + 2.0/3.0 * distance_ratio*distance_ratio) * I + - (1.0 - 2.0*distance_ratio*distance_ratio) * outer_product!(rvec, rvec, 3, c))) 0 else ((1.0 - 9.0/32.0 / distance_ratio) * I + - 3.0/32.0 / distance_ratio * outer_product!(rvec, rvec, 3, c)) - end - end 0 rpy - end - - function rotne_bench(n) 0 x = rand(n) 0 y = rand(n) 0 z = rand(n) 0 r = 0.01 0 rotne_prager(x, y, z, r); - end - - rotne_bench(1000); - @time (for i=1:10; rotne_bench(1000); end) The one thing that seems odd to me is that huge number next to the for loop. Surely the iterator variables allocating during the loop are tiny and every other variable except for the two Float64s (Rij and distance_ratio) is pre-allocated. Does this mean that a large number of temporary variables are generated during the rpy expressions? Intuitively, it seems to be that a fairly simple code such as this shouldn't have to spend 30% time on gc(), but perhaps I'm wrong. On Friday, October 24, 2014 7:38:36 PM UTC+3, Jiahao Chen wrote: Are you sure that @simd and @inbounds give you any performance gains? On my laptop removing all those annotations resulted in essentially no change in performance. Computing the outer product as a*b' is currently slow because it does not use the specialized method for computing outer products. Instead, b' turns the n-vector into a 1xn-row matrix, then a*b' turns a from a n-vector to a nx1-column matrix, then does generic matrix-matrix multiplication (dgemm). You can verify this yourself by running @which a*b' and looking at the method of * that is being dispatched on. As Andreas said, you can try the rank-1 update; one way to use BLAS's function for this is outer_product(a, b, n) = Base.LinAlg.BLAS.ger!(1.0, a, b, zeros(n, n)) but for 3x3 outer products writing your own kernel may very well be faster because you avoid the cost of making a library call. For me the biggest speedup came from rewriting your inner kernel as rpy[3*j-2:3*j, 3*i-2:3*i] = if (Rij 2*r) (0.75 * distance_ratio * ((1.0 + 2.0/3.0 * distance_ratio*distance_ratio) * I + (1.0 - 2.0*distance_ratio*distance_ratio) * outer_product(rvec,
Re: [julia-users] Trying to optimize numerical code - peculiar allocation data
I'm currently avoiding the use of FMMs because the systems I run are fairly small (colloidal soft matter, typically involving a few hundreds of spheres in a small number of aggregates, so each time step is quite fast, but the dynamics take a long time to resolve) and I'd rather not sacrifice any precision if I don't have to. Neighbor lists might be an interesting option to consider, however. I'm not entirely sure what would be gained by providing a function that computes the tensor's action on a vector instead of computing it directly. In a Stokes flow, the velocity of an arbitrary assortment of spheres can be approximately expressed as v = rpy * force_density, which is basically what I'm doing in this case. Assuming the force_density vector isn't sparse, wouldn't you have to compute every element of the tensor anyway? On Friday, October 24, 2014 7:43:56 PM UTC+3, Jiahao Chen wrote: Without knowing much more about your specific problem and use cases, you may also want to consider: - never computing the tensor directly, but provide an indirect representation of it by providing a function that computes its action on a vector - using a fast multipole algorithm to determine the distance cutoff - reducing the number of explicit distance computations using neighbor lists Thanks, Jiahao Chen Staff Research Scientist MIT Computer Science and Artificial Intelligence Laboratory On Fri, Oct 24, 2014 at 12:38 PM, Jiahao Chen jia...@mit.edu javascript: wrote: Are you sure that @simd and @inbounds give you any performance gains? On my laptop removing all those annotations resulted in essentially no change in performance. Computing the outer product as a*b' is currently slow because it does not use the specialized method for computing outer products. Instead, b' turns the n-vector into a 1xn-row matrix, then a*b' turns a from a n-vector to a nx1-column matrix, then does generic matrix-matrix multiplication (dgemm). You can verify this yourself by running @which a*b' and looking at the method of * that is being dispatched on. As Andreas said, you can try the rank-1 update; one way to use BLAS's function for this is outer_product(a, b, n) = Base.LinAlg.BLAS.ger!(1.0, a, b, zeros(n, n)) but for 3x3 outer products writing your own kernel may very well be faster because you avoid the cost of making a library call. For me the biggest speedup came from rewriting your inner kernel as rpy[3*j-2:3*j, 3*i-2:3*i] = if (Rij 2*r) (0.75 * distance_ratio * ((1.0 + 2.0/3.0 * distance_ratio*distance_ratio) * I + (1.0 - 2.0*distance_ratio*distance_ratio) * outer_product(rvec, rvec, 3))) else (1.0 - 9.0/32.0 / distance_ratio) * I + 3.0/32.0 / distance_ratio * outer_product(rvec, rvec, 3) end - The I symbol is an abstract identity matrix that avoids allocating an explicit matrix. - The compiler seems happier when I move the assignment outside the if statement. Thanks, Jiahao Chen Staff Research Scientist MIT Computer Science and Artificial Intelligence Laboratory On Fri, Oct 24, 2014 at 11:24 AM, dexto...@gmail.com javascript: wrote: I'm absolutely sure that Rij 2*r most of the time, yes. This is obvious if you consider that rand(n) returns values in the range [0, 1] and r = 0.01. I have also verified this empirically. I'll try to explore the view update idea further, I've never used that approach before. On Friday, October 24, 2014 5:46:37 PM UTC+3, Andreas Noack wrote: Are you sure that Rij 2*r in your calculations? You should use I instead of eye(3), but probably the best solution is to avoid the explicit calculation of the outer product. You can probably write your problem as a rank one update of a view of pry Med venlig hilsen Andreas Noack
[julia-users] Re: Simple benchmark improvement question
Thanks for the quick feedback, I appreciate it. I had indeed somehow missed Julia's arrays being column-major and taking that into account essentially yields equal performance between Numba and Julia in this benchmark. I'll compare it to Fortran/C++ just to see how large the difference is, but the current performance does seem to be the limit of LLVM-based approaches in general. To answer Tim Holy's point, Numba does do SIMD vectorization, but I haven't tested it extensively in the latest releases. For now, I'll port some of the more compact simulations I'm doing to Julia and Python/Numba for a less trivial benchmark and see how the performance compares then.
[julia-users] Simple benchmark improvement question
Greetings, I'm a reasonably proficient user of MATLAB and Python/NumPy/SciPy doing computational physics. Since Julia appears to be designed to be very well suited to many such applications, I was curious to test its performance before investing much time in converting any research code. To start out, I wrote up the classic 2D regular finite difference Laplace benchmark in Julia, Python and MATLAB in both vectorized and loop versions and tested them all. The results shown in the following Google spreadsheet (all results obtained using a 5000x5000 grid and doing 100 iterations for a reasonable sample using a Haswell i7-4710HQ CPU on Windows 8.1, using Julia 0.3.1, Anaconda 2.0.1 and MATLAB R2014a): https://docs.google.com/spreadsheets/d/1mJ8wNiyYVszkVapRVHvRJZG9j9XQhLLPvaUJwiWrJXY/pubhtml The code itself is published as follows: Julia: http://pastebin.com/AAdXXYZC Python: http://pastebin.com/5hqi9xzf As can be clearly seen, Julia does handily beat both MATLAB and basic Python/NumPy. It does however lose by a factor of 1.65 to a Numba-jitted version of the same Python code (obtained by simply adding a @jit(target='cpu') decorator on top of the appropriate function in naive Python code), which compiles to the same LLVM stack Julia uses. I deliberately avoided using more complex JIT techniques for Python (such as using Pythran to compile to OpenMP-enabled C code by specifying function signatures) to stick to single core performance only. Given these results and the virtual given that my Julia code is naive, non-idiomatic and just plain bad, I'd like to know if there's anything I could improve to match or (if possible) beat JIT-compiled Python.