Re: [julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-14 Thread gael . mcdon


 I wonder if we should provide access to DSFMT's random array generation, 
 so that one can use an array generator. The requirements are that one has 
 to generate at least 384 random numbers at a time or more, and the size of 
 the array must necessarily be even. 


 We should not allow this with the global seed, and it can be through a 
 randarray!() function. We can even avoid exporting this function by 
 default, since there are lots of conditions it needs, but it gives really 
 high performance.

Are the conditions needed limited to n384 and n even?

Why not providing it by default then with a single if statement to check 
for the n384 condition? The n even condition is not really a problem as 
Julia does not allocate the exact amount of data needed. Even for 
fixed-size array, adding 1 extra element (not user accessible) does not 
seem to be much of a drawback.

 


 -viral



Re: [julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-13 Thread Viral Shah
I wonder if we should provide access to DSFMT's random array generation, so 
that one can use an array generator. The requirements are that one has to 
generate at least 384 random numbers at a time or more, and the size of the 
array must necessarily be even.

We should not allow this with the global seed, and it can be through a 
randarray!() function. We can even avoid exporting this function by 
default, since there are lots of conditions it needs, but it gives really 
high performance.

-viral

On Friday, September 12, 2014 6:56:39 PM UTC+5:30, Ján Dolinský wrote:

 Yes, 6581 sounds like it. Thanks for the clarification.

 Jan 

 Dňa piatok, 12. septembra 2014 14:12:46 UTC+2 Andreas Noack napísal(-a):

 I think the reason for the slow down in rand since 2.1 is this

 https://github.com/JuliaLang/julia/pull/6581

 Right now we are filling the array one by one which is not efficient, but 
 unfortunately it is our best option right now. In applications where you 
 draw one random variate at a time there shouldn't be difference.

 Med venlig hilsen

 Andreas Noack

 2014-09-12 4:46 GMT-04:00 Ján Dolinský:

 Finally, I found that Octave has an equivalent to sumabs2() called 
 sumsq(). Just for sake of completeness here are the timings:

 Octave
 X = rand(7000);
 tic; sumsq(X); toc;
 Elapsed time is 0.0616651 seconds.

 Julia v0.3
 @time X = rand(7000,7000);
 elapsed time: 0.285218597 seconds (392000160 bytes allocated)
 @time sumabs2(X, 1);
 elapsed time: 0.05705666 seconds (56496 bytes allocated)


 Essentially speed is about the  same with Julia being a little faster.

 It was however interesting to observe that @time X = rand(7000,7000);
 is about 2.5 times slower in Julia 0.3 than it was in Julia 0.2 ...

 in Julia (v0.2.1):
  @time X = rand(7000,7000);
 elapsed time: 0.114418731 seconds (392000128 bytes allocated)
  

 Jan

 Dňa utorok, 9. septembra 2014 17:06:59 UTC+2 Ján Dolinský napísal(-a):

  Hello Andreas,

 Thanks for the tip. I'll check it out. Thumbs up for the 0.4!

 Jan

  On 09.09.2014 17:04, Andreas Noack wrote:
  
 If you need the speed now you can try one of the package ArrayViews or 
 ArrayViewsAPL. It is something similar to the functionality in these 
 packages that we are trying to include in base.

  Med venlig hilsen

 Andreas Noack
  
 2014-09-09 9:38 GMT-04:00 Ján Dolinský:

  OK, so basically there is nothing wrong with the syntax 
 X[:,1001:end] ?   
  d = sumabs2(X[:,1001:end], 1);
  and I should just wait until v0.4 is available (perhaps available 
 soon in Julia Nightlies PPA).

 I did the benchmark with the floating point power function based on 
 Simon's comment. Here are my results (after couple of repetitive 
 iterations):
  @time X.^2;
 elapsed time: 0.511988142 seconds (392000256 bytes allocated, 2.52% gc 
 time)
 @time X.^2.0;
 elapsed time: 0.411791612 seconds (392000256 bytes allocated, 3.12% gc 
 time)
  
 Thanks, 
 Jan Dolinsky

   On 09.09.2014 14:06, Andreas Noack wrote:
  
 The problem is that right now X[:,1001,end] makes a copy of the array. 
 However,  in 0.4 this will instead be a view of the original matrix and 
 therefore the computing time should be almost the same. 

  It might also be worth repeating Simon's comment that the floating 
 point power function has special handling of 2. The result is that

 julia @time A.^2;
 elapsed time: 1.402791357 seconds (20256 bytes allocated, 5.90% gc 
 time)

 julia @time A.^2.0;
 elapsed time: 0.554241105 seconds (20256 bytes allocated, 15.04% 
 gc time) 

  I tend to agree with Simon that special casing of integer 2 would be 
 reasonable.
  
  Med venlig hilsen

 Andreas Noack
  
 2014-09-09 4:24 GMT-04:00 Ján Dolinský:

  Hello guys,

 Thanks a lot for the lengthy discussions. It helped me a lot to get a 
 feeling on what is Julia like. I did some more performance comparisons 
 as 
 suggested by first two posts (thanks a lot for the tips). In the mean 
 time 
 I upgraded to v0.3.
  X = rand(7000,7000);
 @time d = sum(X.^2, 1);
 elapsed time: 0.573125833 seconds (392056672 bytes allocated, 2.25% 
 gc time)
 @time d = sum(X.*X, 1);
 elapsed time: 0.178715901 seconds (392057080 bytes allocated, 14.06% 
 gc time)
 @time d = sumabs2(X, 1);
 elapsed time: 0.067431808 seconds (56496 bytes allocated)
  
 In Octave then
  X = rand(7000);
 tic; d = sum(X.^2); toc;
 Elapsed time is 0.167578 seconds.
  
 So the ultimate solution is the sumabs2 function which is a blast. I 
 am comming from Matlab/Octave and I would expect X.^2 to be fast out of 
 the box but nevertheless if I can get an excellent performance by 
 learning 
 some new paradigms I will go for it.

 The above tests lead me to another question. I often need to 
 calculate the self dot product over a portion of a matrix, e.g.
  @time d = sumabs2(X[:,1001:end], 1);
 elapsed time: 0.17566 seconds (336048688 bytes allocated, 7.01% 
 gc time)
  
 Apparently this is not a way to do it in Julia because working on a 
 smaller matrix of 7000x6000 gives more 

Re: [julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-12 Thread Ján Dolinský
Finally, I found that Octave has an equivalent to sumabs2() called sumsq(). 
Just for sake of completeness here are the timings:

Octave
X = rand(7000);
tic; sumsq(X); toc;
Elapsed time is 0.0616651 seconds.

Julia v0.3
@time X = rand(7000,7000);
elapsed time: 0.285218597 seconds (392000160 bytes allocated)
@time sumabs2(X, 1);
elapsed time: 0.05705666 seconds (56496 bytes allocated)


Essentially speed is about the  same with Julia being a little faster.

It was however interesting to observe that @time X = rand(7000,7000);
is about 2.5 times slower in Julia 0.3 than it was in Julia 0.2 ...

in Julia (v0.2.1):
 @time X = rand(7000,7000);
elapsed time: 0.114418731 seconds (392000128 bytes allocated)
 

Jan

Dňa utorok, 9. septembra 2014 17:06:59 UTC+2 Ján Dolinský napísal(-a):

  Hello Andreas,

 Thanks for the tip. I'll check it out. Thumbs up for the 0.4!

 Jan

  On 09.09.2014 17:04, Andreas Noack wrote:
  
 If you need the speed now you can try one of the package ArrayViews or 
 ArrayViewsAPL. It is something similar to the functionality in these 
 packages that we are trying to include in base.

  Med venlig hilsen

 Andreas Noack
  
 2014-09-09 9:38 GMT-04:00 Ján Dolinský:

  OK, so basically there is nothing wrong with the syntax X[:,1001:end] ?   

  d = sumabs2(X[:,1001:end], 1);
  and I should just wait until v0.4 is available (perhaps available soon 
 in Julia Nightlies PPA).

 I did the benchmark with the floating point power function based on 
 Simon's comment. Here are my results (after couple of repetitive 
 iterations):
  @time X.^2;
 elapsed time: 0.511988142 seconds (392000256 bytes allocated, 2.52% gc 
 time)
 @time X.^2.0;
 elapsed time: 0.411791612 seconds (392000256 bytes allocated, 3.12% gc 
 time)
  
 Thanks, 
 Jan Dolinsky

   On 09.09.2014 14:06, Andreas Noack wrote:
  
 The problem is that right now X[:,1001,end] makes a copy of the array. 
 However,  in 0.4 this will instead be a view of the original matrix and 
 therefore the computing time should be almost the same. 

  It might also be worth repeating Simon's comment that the floating 
 point power function has special handling of 2. The result is that

 julia @time A.^2;
 elapsed time: 1.402791357 seconds (20256 bytes allocated, 5.90% gc 
 time)

 julia @time A.^2.0;
 elapsed time: 0.554241105 seconds (20256 bytes allocated, 15.04% gc 
 time) 

  I tend to agree with Simon that special casing of integer 2 would be 
 reasonable.
  
  Med venlig hilsen

 Andreas Noack
  
 2014-09-09 4:24 GMT-04:00 Ján Dolinský:

 Hello guys,

 Thanks a lot for the lengthy discussions. It helped me a lot to get a 
 feeling on what is Julia like. I did some more performance comparisons as 
 suggested by first two posts (thanks a lot for the tips). In the mean time 
 I upgraded to v0.3.
  X = rand(7000,7000);
 @time d = sum(X.^2, 1);
 elapsed time: 0.573125833 seconds (392056672 bytes allocated, 2.25% gc 
 time)
 @time d = sum(X.*X, 1);
 elapsed time: 0.178715901 seconds (392057080 bytes allocated, 14.06% gc 
 time)
 @time d = sumabs2(X, 1);
 elapsed time: 0.067431808 seconds (56496 bytes allocated)
  
 In Octave then
  X = rand(7000);
 tic; d = sum(X.^2); toc;
 Elapsed time is 0.167578 seconds.
  
 So the ultimate solution is the sumabs2 function which is a blast. I am 
 comming from Matlab/Octave and I would expect X.^2 to be fast out of the 
 box but nevertheless if I can get an excellent performance by learning 
 some new paradigms I will go for it.

 The above tests lead me to another question. I often need to calculate 
 the self dot product over a portion of a matrix, e.g.
  @time d = sumabs2(X[:,1001:end], 1);
 elapsed time: 0.17566 seconds (336048688 bytes allocated, 7.01% gc 
 time)
  
 Apparently this is not a way to do it in Julia because working on a 
 smaller matrix of 7000x6000 gives more than double computing time and 
 furthermore it seems to allocate unnecessary memory.

 Best Regards,
 Jan



 Dňa pondelok, 8. septembra 2014 10:36:02 UTC+2 Ján Dolinský napísal(-a): 
  
 Hello,

 I am a new Julia user. I am trying to write a function for computing 
 self dot product of all columns in a matrix, i.e. calculating a square 
 of 
 each element of a matrix and computing a column-wise sum. I am interested 
 in a proper way of doing it because I often need to process large matrices.

 I first put a focus on calculating the squares. For testing purposes I 
 use a matrix of random floats of size 7000x7000. All timings here are 
 deducted after several repetitive runs.

 I used to do it in Octave (v3.8.1) a follows:
  tic; X = rand(7000); toc;
 Elapsed time is 0.579093 seconds.
 tic; XX = X.^2; toc;
 Elapsed time is 0.114737 seconds.
  

 I tried to to the same in Julia (v0.2.1):
  @time X = rand(7000,7000);
 elapsed time: 0.114418731 seconds (392000128 bytes allocated)
 @time XX = X.^2;
 elapsed time: 0.369641268 seconds (392000224 bytes allocated)
  
 I was surprised to see that Julia is about 3 times slower when 
 

Re: [julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-12 Thread Andreas Noack
I think the reason for the slow down in rand since 2.1 is this

https://github.com/JuliaLang/julia/pull/6581

Right now we are filling the array one by one which is not efficient, but
unfortunately it is our best option right now. In applications where you
draw one random variate at a time there shouldn't be difference.

Med venlig hilsen

Andreas Noack

2014-09-12 4:46 GMT-04:00 Ján Dolinský jan.dolin...@2bridgz.com:

 Finally, I found that Octave has an equivalent to sumabs2() called
 sumsq(). Just for sake of completeness here are the timings:

 Octave
 X = rand(7000);
 tic; sumsq(X); toc;
 Elapsed time is 0.0616651 seconds.

 Julia v0.3
 @time X = rand(7000,7000);
 elapsed time: 0.285218597 seconds (392000160 bytes allocated)
 @time sumabs2(X, 1);
 elapsed time: 0.05705666 seconds (56496 bytes allocated)


 Essentially speed is about the  same with Julia being a little faster.

 It was however interesting to observe that @time X = rand(7000,7000);
 is about 2.5 times slower in Julia 0.3 than it was in Julia 0.2 ...

 in Julia (v0.2.1):
  @time X = rand(7000,7000);
 elapsed time: 0.114418731 seconds (392000128 bytes allocated)


 Jan

 Dňa utorok, 9. septembra 2014 17:06:59 UTC+2 Ján Dolinský napísal(-a):

  Hello Andreas,

 Thanks for the tip. I'll check it out. Thumbs up for the 0.4!

 Jan

  On 09.09.2014 17:04, Andreas Noack wrote:

 If you need the speed now you can try one of the package ArrayViews or
 ArrayViewsAPL. It is something similar to the functionality in these
 packages that we are trying to include in base.

  Med venlig hilsen

 Andreas Noack

 2014-09-09 9:38 GMT-04:00 Ján Dolinský:

  OK, so basically there is nothing wrong with the syntax X[:,1001:end]
 ?
  d = sumabs2(X[:,1001:end], 1);
  and I should just wait until v0.4 is available (perhaps available soon
 in Julia Nightlies PPA).

 I did the benchmark with the floating point power function based on
 Simon's comment. Here are my results (after couple of repetitive
 iterations):
  @time X.^2;
 elapsed time: 0.511988142 seconds (392000256 bytes allocated, 2.52% gc
 time)
 @time X.^2.0;
 elapsed time: 0.411791612 seconds (392000256 bytes allocated, 3.12% gc
 time)

 Thanks,
 Jan Dolinsky

   On 09.09.2014 14:06, Andreas Noack wrote:

 The problem is that right now X[:,1001,end] makes a copy of the array.
 However,  in 0.4 this will instead be a view of the original matrix and
 therefore the computing time should be almost the same.

  It might also be worth repeating Simon's comment that the floating
 point power function has special handling of 2. The result is that

 julia @time A.^2;
 elapsed time: 1.402791357 seconds (20256 bytes allocated, 5.90% gc
 time)

 julia @time A.^2.0;
 elapsed time: 0.554241105 seconds (20256 bytes allocated, 15.04% gc
 time)

  I tend to agree with Simon that special casing of integer 2 would be
 reasonable.

  Med venlig hilsen

 Andreas Noack

 2014-09-09 4:24 GMT-04:00 Ján Dolinský:

  Hello guys,

 Thanks a lot for the lengthy discussions. It helped me a lot to get a
 feeling on what is Julia like. I did some more performance comparisons as
 suggested by first two posts (thanks a lot for the tips). In the mean time
 I upgraded to v0.3.
  X = rand(7000,7000);
 @time d = sum(X.^2, 1);
 elapsed time: 0.573125833 seconds (392056672 bytes allocated, 2.25% gc
 time)
 @time d = sum(X.*X, 1);
 elapsed time: 0.178715901 seconds (392057080 bytes allocated, 14.06%
 gc time)
 @time d = sumabs2(X, 1);
 elapsed time: 0.067431808 seconds (56496 bytes allocated)

 In Octave then
  X = rand(7000);
 tic; d = sum(X.^2); toc;
 Elapsed time is 0.167578 seconds.

 So the ultimate solution is the sumabs2 function which is a blast. I am
 comming from Matlab/Octave and I would expect X.^2 to be fast out of the
 box but nevertheless if I can get an excellent performance by learning
 some new paradigms I will go for it.

 The above tests lead me to another question. I often need to calculate
 the self dot product over a portion of a matrix, e.g.
  @time d = sumabs2(X[:,1001:end], 1);
 elapsed time: 0.17566 seconds (336048688 bytes allocated, 7.01% gc
 time)

 Apparently this is not a way to do it in Julia because working on a
 smaller matrix of 7000x6000 gives more than double computing time and
 furthermore it seems to allocate unnecessary memory.

 Best Regards,
 Jan



 Dňa pondelok, 8. septembra 2014 10:36:02 UTC+2 Ján Dolinský
 napísal(-a):

 Hello,

 I am a new Julia user. I am trying to write a function for computing
 self dot product of all columns in a matrix, i.e. calculating a square 
 of
 each element of a matrix and computing a column-wise sum. I am interested
 in a proper way of doing it because I often need to process large 
 matrices.

 I first put a focus on calculating the squares. For testing purposes I
 use a matrix of random floats of size 7000x7000. All timings here are
 deducted after several repetitive runs.

 I used to do it in Octave (v3.8.1) a follows:
  tic; X = rand(7000); toc;
 

Re: [julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-12 Thread Ján Dolinský
Yes, 6581 sounds like it. Thanks for the clarification.

Jan 

Dňa piatok, 12. septembra 2014 14:12:46 UTC+2 Andreas Noack napísal(-a):

 I think the reason for the slow down in rand since 2.1 is this

 https://github.com/JuliaLang/julia/pull/6581

 Right now we are filling the array one by one which is not efficient, but 
 unfortunately it is our best option right now. In applications where you 
 draw one random variate at a time there shouldn't be difference.

 Med venlig hilsen

 Andreas Noack

 2014-09-12 4:46 GMT-04:00 Ján Dolinský:

 Finally, I found that Octave has an equivalent to sumabs2() called 
 sumsq(). Just for sake of completeness here are the timings:

 Octave
 X = rand(7000);
 tic; sumsq(X); toc;
 Elapsed time is 0.0616651 seconds.

 Julia v0.3
 @time X = rand(7000,7000);
 elapsed time: 0.285218597 seconds (392000160 bytes allocated)
 @time sumabs2(X, 1);
 elapsed time: 0.05705666 seconds (56496 bytes allocated)


 Essentially speed is about the  same with Julia being a little faster.

 It was however interesting to observe that @time X = rand(7000,7000);
 is about 2.5 times slower in Julia 0.3 than it was in Julia 0.2 ...

 in Julia (v0.2.1):
  @time X = rand(7000,7000);
 elapsed time: 0.114418731 seconds (392000128 bytes allocated)
  

 Jan

 Dňa utorok, 9. septembra 2014 17:06:59 UTC+2 Ján Dolinský napísal(-a):

  Hello Andreas,

 Thanks for the tip. I'll check it out. Thumbs up for the 0.4!

 Jan

  On 09.09.2014 17:04, Andreas Noack wrote:
  
 If you need the speed now you can try one of the package ArrayViews or 
 ArrayViewsAPL. It is something similar to the functionality in these 
 packages that we are trying to include in base.

  Med venlig hilsen

 Andreas Noack
  
 2014-09-09 9:38 GMT-04:00 Ján Dolinský:

  OK, so basically there is nothing wrong with the syntax X[:,1001:end] 
 ?   
  d = sumabs2(X[:,1001:end], 1);
  and I should just wait until v0.4 is available (perhaps available 
 soon in Julia Nightlies PPA).

 I did the benchmark with the floating point power function based on 
 Simon's comment. Here are my results (after couple of repetitive 
 iterations):
  @time X.^2;
 elapsed time: 0.511988142 seconds (392000256 bytes allocated, 2.52% gc 
 time)
 @time X.^2.0;
 elapsed time: 0.411791612 seconds (392000256 bytes allocated, 3.12% gc 
 time)
  
 Thanks, 
 Jan Dolinsky

   On 09.09.2014 14:06, Andreas Noack wrote:
  
 The problem is that right now X[:,1001,end] makes a copy of the array. 
 However,  in 0.4 this will instead be a view of the original matrix and 
 therefore the computing time should be almost the same. 

  It might also be worth repeating Simon's comment that the floating 
 point power function has special handling of 2. The result is that

 julia @time A.^2;
 elapsed time: 1.402791357 seconds (20256 bytes allocated, 5.90% gc 
 time)

 julia @time A.^2.0;
 elapsed time: 0.554241105 seconds (20256 bytes allocated, 15.04% gc 
 time) 

  I tend to agree with Simon that special casing of integer 2 would be 
 reasonable.
  
  Med venlig hilsen

 Andreas Noack
  
 2014-09-09 4:24 GMT-04:00 Ján Dolinský:

  Hello guys,

 Thanks a lot for the lengthy discussions. It helped me a lot to get a 
 feeling on what is Julia like. I did some more performance comparisons as 
 suggested by first two posts (thanks a lot for the tips). In the mean 
 time 
 I upgraded to v0.3.
  X = rand(7000,7000);
 @time d = sum(X.^2, 1);
 elapsed time: 0.573125833 seconds (392056672 bytes allocated, 2.25% 
 gc time)
 @time d = sum(X.*X, 1);
 elapsed time: 0.178715901 seconds (392057080 bytes allocated, 14.06% 
 gc time)
 @time d = sumabs2(X, 1);
 elapsed time: 0.067431808 seconds (56496 bytes allocated)
  
 In Octave then
  X = rand(7000);
 tic; d = sum(X.^2); toc;
 Elapsed time is 0.167578 seconds.
  
 So the ultimate solution is the sumabs2 function which is a blast. I 
 am comming from Matlab/Octave and I would expect X.^2 to be fast out of 
 the box but nevertheless if I can get an excellent performance by 
 learning 
 some new paradigms I will go for it.

 The above tests lead me to another question. I often need to calculate 
 the self dot product over a portion of a matrix, e.g.
  @time d = sumabs2(X[:,1001:end], 1);
 elapsed time: 0.17566 seconds (336048688 bytes allocated, 7.01% 
 gc time)
  
 Apparently this is not a way to do it in Julia because working on a 
 smaller matrix of 7000x6000 gives more than double computing time and 
 furthermore it seems to allocate unnecessary memory.

 Best Regards,
 Jan



 Dňa pondelok, 8. septembra 2014 10:36:02 UTC+2 Ján Dolinský 
 napísal(-a): 
  
 Hello,

 I am a new Julia user. I am trying to write a function for computing 
 self dot product of all columns in a matrix, i.e. calculating a square 
 of 
 each element of a matrix and computing a column-wise sum. I am 
 interested 
 in a proper way of doing it because I often need to process large 
 matrices.

 I first put a focus on calculating the squares. For testing purposes 
 I 

[julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-09 Thread Ján Dolinský
Hello guys,

Thanks a lot for the lengthy discussions. It helped me a lot to get a 
feeling on what is Julia like. I did some more performance comparisons as 
suggested by first two posts (thanks a lot for the tips). In the mean time 
I upgraded to v0.3.
X = rand(7000,7000);
@time d = sum(X.^2, 1);
elapsed time: 0.573125833 seconds (392056672 bytes allocated, 2.25% gc time)
@time d = sum(X.*X, 1);
elapsed time: 0.178715901 seconds (392057080 bytes allocated, 14.06% gc time
)
@time d = sumabs2(X, 1);
elapsed time: 0.067431808 seconds (56496 bytes allocated)

In Octave then
X = rand(7000);
tic; d = sum(X.^2); toc;
Elapsed time is 0.167578 seconds.

So the ultimate solution is the sumabs2 function which is a blast. I am 
comming from Matlab/Octave and I would expect X.^2 to be fast out of the 
box but nevertheless if I can get an excellent performance by learning 
some new paradigms I will go for it.

The above tests lead me to another question. I often need to calculate the 
self dot product over a portion of a matrix, e.g.
@time d = sumabs2(X[:,1001:end], 1);
elapsed time: 0.17566 seconds (336048688 bytes allocated, 7.01% gc time)

Apparently this is not a way to do it in Julia because working on a smaller 
matrix of 7000x6000 gives more than double computing time and furthermore 
it seems to allocate unnecessary memory.

Best Regards,
Jan



Dňa pondelok, 8. septembra 2014 10:36:02 UTC+2 Ján Dolinský napísal(-a):

 Hello,

 I am a new Julia user. I am trying to write a function for computing 
 self dot product of all columns in a matrix, i.e. calculating a square of 
 each element of a matrix and computing a column-wise sum. I am interested 
 in a proper way of doing it because I often need to process large matrices.

 I first put a focus on calculating the squares. For testing purposes I use 
 a matrix of random floats of size 7000x7000. All timings here are deducted 
 after several repetitive runs.

 I used to do it in Octave (v3.8.1) a follows:
 tic; X = rand(7000); toc;
 Elapsed time is 0.579093 seconds.
 tic; XX = X.^2; toc;
 Elapsed time is 0.114737 seconds.


 I tried to to the same in Julia (v0.2.1):
 @time X = rand(7000,7000);
 elapsed time: 0.114418731 seconds (392000128 bytes allocated)
 @time XX = X.^2;
 elapsed time: 0.369641268 seconds (392000224 bytes allocated)

 I was surprised to see that Julia is about 3 times slower when calculating 
 a square than my original routine in Octave. I then read Performance tips 
 and found out that one should use * instead of of raising to small integer 
 powers, for example x*x*x instead of x^3. I therefore tested the 
 following.
 @time XX = X.*X;
 elapsed time: 0.146059577 seconds (392000968 bytes allocated)

 This approach indeed resulted in a lot shorter computing time. It is still 
 however a little slower than my code in Octave. Can someone advise on any 
 performance tips ?

 I then finally do a sum over all columns of XX to get the self dot 
 product but first I'd like to fix the squaring part.

 Thanks a lot. 
 Best Regards,
 Jan 

 p.s. In Julia manual I found a while ago an example of using @vectorize 
 macro with a squaring function but can not find it any more. Perhaps the 
 name of macro was different ... 
   



Re: [julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-09 Thread Andreas Noack
The problem is that right now X[:,1001,end] makes a copy of the array.
However,  in 0.4 this will instead be a view of the original matrix and
therefore the computing time should be almost the same.

It might also be worth repeating Simon's comment that the floating point
power function has special handling of 2. The result is that

julia @time A.^2;
elapsed time: 1.402791357 seconds (20256 bytes allocated, 5.90% gc time)

julia @time A.^2.0;
elapsed time: 0.554241105 seconds (20256 bytes allocated, 15.04% gc
time)

I tend to agree with Simon that special casing of integer 2 would be
reasonable.

Med venlig hilsen

Andreas Noack

2014-09-09 4:24 GMT-04:00 Ján Dolinský jan.dolin...@2bridgz.com:

 Hello guys,

 Thanks a lot for the lengthy discussions. It helped me a lot to get a
 feeling on what is Julia like. I did some more performance comparisons as
 suggested by first two posts (thanks a lot for the tips). In the mean time
 I upgraded to v0.3.
 X = rand(7000,7000);
 @time d = sum(X.^2, 1);
 elapsed time: 0.573125833 seconds (392056672 bytes allocated, 2.25% gc
 time)
 @time d = sum(X.*X, 1);
 elapsed time: 0.178715901 seconds (392057080 bytes allocated, 14.06% gc
 time)
 @time d = sumabs2(X, 1);
 elapsed time: 0.067431808 seconds (56496 bytes allocated)

 In Octave then
 X = rand(7000);
 tic; d = sum(X.^2); toc;
 Elapsed time is 0.167578 seconds.

 So the ultimate solution is the sumabs2 function which is a blast. I am
 comming from Matlab/Octave and I would expect X.^2 to be fast out of the
 box but nevertheless if I can get an excellent performance by learning
 some new paradigms I will go for it.

 The above tests lead me to another question. I often need to calculate the
 self dot product over a portion of a matrix, e.g.
 @time d = sumabs2(X[:,1001:end], 1);
 elapsed time: 0.17566 seconds (336048688 bytes allocated, 7.01% gc
 time)

 Apparently this is not a way to do it in Julia because working on a
 smaller matrix of 7000x6000 gives more than double computing time and
 furthermore it seems to allocate unnecessary memory.

 Best Regards,
 Jan



 Dňa pondelok, 8. septembra 2014 10:36:02 UTC+2 Ján Dolinský napísal(-a):

 Hello,

 I am a new Julia user. I am trying to write a function for computing
 self dot product of all columns in a matrix, i.e. calculating a square of
 each element of a matrix and computing a column-wise sum. I am interested
 in a proper way of doing it because I often need to process large matrices.

 I first put a focus on calculating the squares. For testing purposes I
 use a matrix of random floats of size 7000x7000. All timings here are
 deducted after several repetitive runs.

 I used to do it in Octave (v3.8.1) a follows:
 tic; X = rand(7000); toc;
 Elapsed time is 0.579093 seconds.
 tic; XX = X.^2; toc;
 Elapsed time is 0.114737 seconds.


 I tried to to the same in Julia (v0.2.1):
 @time X = rand(7000,7000);
 elapsed time: 0.114418731 seconds (392000128 bytes allocated)
 @time XX = X.^2;
 elapsed time: 0.369641268 seconds (392000224 bytes allocated)

 I was surprised to see that Julia is about 3 times slower when
 calculating a square than my original routine in Octave. I then read
 Performance tips and found out that one should use * instead of of
 raising to small integer powers, for example x*x*x instead of x^3. I
 therefore tested the following.
 @time XX = X.*X;
 elapsed time: 0.146059577 seconds (392000968 bytes allocated)

 This approach indeed resulted in a lot shorter computing time. It is
 still however a little slower than my code in Octave. Can someone advise on
 any performance tips ?

 I then finally do a sum over all columns of XX to get the self dot
 product but first I'd like to fix the squaring part.

 Thanks a lot.
 Best Regards,
 Jan

 p.s. In Julia manual I found a while ago an example of using @vectorize
 macro with a squaring function but can not find it any more. Perhaps the
 name of macro was different ...





Re: [julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-09 Thread Tim Holy
Is it really better to introduce VML as a temporary hack than it is to fix 
LLVM, even if the latter takes a little longer?

In Matlab, until quite recently X.^2 was slower than X.*X. It was that way for 
something like 25 years before they fixed it.

--Tim

On Monday, September 08, 2014 10:50:53 PM Jeff Waller wrote:
 I feel that x^2 must among the things that are convenient and fast; it's
 one of those language elements that people simply depend on.  If it's
 inconvenient, then people are going to experience that over and over.  If
 it's not fast enough people are going experience slowness over and over.
 
 Like Simon says it's already a thing
 https://github.com/JuliaLang/julia/issues/2741.  Maybe use of something
 like VML is an option or necessary, or maybe extending inference.jl or
 maybe even it eventually might not be necessary
 http://llvm.org/docs/Vectorizers.html.
 
 Julia is a fundamental improvement, but give Matlab and Octave their due,
 that syntax is great.  When I say essentially C, to me it means
 
 Option A use built in infix:
 
 X.^y
 
 versus Option B go write a function
 
 pow(x,y)
for i = 1 ...
   for j = 
...
etc
 
 Option A is just too good, it has to be supported.
 
 What to do?  Is this a fundamental flaw?  No I don't think so.  Is this a
 one time only thing?  It feels like no, this is one of many things that
 will occasionally occur.  Is it possible to make this a hassle?  Like I
 think Tony is saying, can/should there be a branch of immediate
 optimizations?  Stuff that would eventually be done in a better way but
 needs to be completed now.  It's a branch so things can be collected and
 retracted more coherently.



Re: [julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-09 Thread Patrick O'Leary
On Tuesday, September 9, 2014 8:02:44 AM UTC-5, Tim Holy wrote:

 In Matlab, until quite recently X.^2 was slower than X.*X. It was that way 
 for 
 something like 25 years before they fixed it. 


And a couple more years before it was fixed in Coder. And an expensive, 
very branchy set of guards was generated to wrap every invocation of pow(). 
We once had to change a whole bunch of these to X.*X to get a simulation 
down to realtime... 

Patrick


Re: [julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-09 Thread Ján Dolinský

OK, so basically there is nothing wrong with the syntax X[:,1001:end] ?
|
d =sumabs2(X[:,1001:end],1);
|
||and I should just wait until v0.4 is available (perhaps available soon 
in Julia Nightlies PPA).


I did the benchmark with the floating point power function based on 
Simon's comment. Here are my results (after couple of repetitive 
iterations):

|
@time X.^2;
elapsed time: 0.511988142 seconds (392000256 bytes allocated, 2.52% gc time)
@time X.^2.0;
elapsed time: 0.411791612 seconds (392000256 bytes allocated, 3.12% gc time)
|

Thanks,
Jan Dolinsky

On 09.09.2014 14:06, Andreas Noack wrote:
The problem is that right now X[:,1001,end] makes a copy of the array. 
However,  in 0.4 this will instead be a view of the original matrix 
and therefore the computing time should be almost the same.


It might also be worth repeating Simon's comment that the floating 
point power function has special handling of 2. The result is that


julia @time A.^2;
elapsed time: 1.402791357 seconds (20256 bytes allocated, 5.90% gc 
time)


julia @time A.^2.0;
elapsed time: 0.554241105 seconds (20256 bytes allocated, 15.04% 
gc time)


I tend to agree with Simon that special casing of integer 2 would be 
reasonable.


Med venlig hilsen

Andreas Noack

2014-09-09 4:24 GMT-04:00 Ján Dolinský jan.dolin...@2bridgz.com 
mailto:jan.dolin...@2bridgz.com:


Hello guys,

Thanks a lot for the lengthy discussions. It helped me a lot to
get a feeling on what is Julia like. I did some more performance
comparisons as suggested by first two posts (thanks a lot for the
tips). In the mean time I upgraded to v0.3.
|
X =rand(7000,7000);
@timed =sum(X.^2,1);
elapsed time:0.573125833seconds (392056672bytes allocated,2.25%gc
time)
@timed =sum(X.*X,1);
elapsed time:0.178715901seconds (392057080bytes allocated,14.06%gc
time)
@timed =sumabs2(X,1);
elapsed time:0.067431808seconds (56496bytes allocated)
|

In Octave then
|
X =rand(7000);
tic;d =sum(X.^2);toc;
Elapsedtime is0.167578seconds.
|

So the ultimate solution is the sumabs2 function which is a blast.
I am comming from Matlab/Octave and I would expect X.^2 to be fast
out of the box but nevertheless if I can get an excellent
performance by learning some new paradigms I will go for it.

The above tests lead me to another question. I often need to
calculate the self dot product over a portion of a matrix, e.g.
|
@timed =sumabs2(X[:,1001:end],1);
elapsed time:0.17566seconds (336048688bytes allocated,7.01%gc
time)
|

Apparently this is not a way to do it in Julia because working on
a smaller matrix of 7000x6000 gives more than double computing
time and furthermore it seems to allocate unnecessary memory.

Best Regards,
Jan



Dňa pondelok, 8. septembra 2014 10:36:02 UTC+2 Ján Dolinský
napísal(-a):

Hello,

I am a new Julia user. I am trying to write a function for
computing self dot product of all columns in a matrix, i.e.
calculating a square of each element of a matrix and computing
a column-wise sum. I am interested in a proper way of doing it
because I often need to process large matrices.

I first put a focus on calculating the squares. For testing
purposes I use a matrix of random floats of size 7000x7000.
All timings here are deducted after several repetitive runs.

I used to do it in Octave (v3.8.1) a follows:
|
tic;X =rand(7000);toc;
Elapsedtime is0.579093seconds.
tic;XX =X.^2;toc;
Elapsedtime is0.114737seconds.
|


I tried to to the same in Julia (v0.2.1):
|
@timeX =rand(7000,7000);
elapsed time:0.114418731seconds (392000128bytes allocated)
@timeXX =X.^2;
elapsed time:0.369641268seconds (392000224bytes allocated)
|

I was surprised to see that Julia is about 3 times slower when
calculating a square than my original routine in Octave. I
then read Performance tips and found out that one should use
* instead of of raising to small integer powers, for example
x*x*x instead of x^3. I therefore tested the following.
|
@timeXX =X.*X;
elapsed time:0.146059577seconds (392000968bytes allocated)
|

This approach indeed resulted in a lot shorter computing time.
It is still however a little slower than my code in Octave.
Can someone advise on any performance tips ?

I then finally do a sum over all columns of XX to get the
self dot product but first I'd like to fix the squaring part.

Thanks a lot.
Best Regards,
Jan

p.s. In Julia manual I found a while ago an example of using
@vectorize macro with a squaring function but can not find it
any more. Perhaps the name of macro was different ...





Re: [julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-09 Thread Andreas Noack
If you need the speed now you can try one of the package ArrayViews or
ArrayViewsAPL. It is something similar to the functionality in these
packages that we are trying to include in base.

Med venlig hilsen

Andreas Noack

2014-09-09 9:38 GMT-04:00 Ján Dolinský jan.dolin...@2bridgz.com:

  OK, so basically there is nothing wrong with the syntax X[:,1001:end] ?

  d = sumabs2(X[:,1001:end], 1);
  and I should just wait until v0.4 is available (perhaps available soon
 in Julia Nightlies PPA).

 I did the benchmark with the floating point power function based on
 Simon's comment. Here are my results (after couple of repetitive
 iterations):
  @time X.^2;
 elapsed time: 0.511988142 seconds (392000256 bytes allocated, 2.52% gc
 time)
 @time X.^2.0;
 elapsed time: 0.411791612 seconds (392000256 bytes allocated, 3.12% gc
 time)

 Thanks,
 Jan Dolinsky

  On 09.09.2014 14:06, Andreas Noack wrote:

 The problem is that right now X[:,1001,end] makes a copy of the array.
 However,  in 0.4 this will instead be a view of the original matrix and
 therefore the computing time should be almost the same.

  It might also be worth repeating Simon's comment that the floating point
 power function has special handling of 2. The result is that

 julia @time A.^2;
 elapsed time: 1.402791357 seconds (20256 bytes allocated, 5.90% gc
 time)

 julia @time A.^2.0;
 elapsed time: 0.554241105 seconds (20256 bytes allocated, 15.04% gc
 time)

  I tend to agree with Simon that special casing of integer 2 would be
 reasonable.

  Med venlig hilsen

 Andreas Noack

 2014-09-09 4:24 GMT-04:00 Ján Dolinský jan.dolin...@2bridgz.com:

 Hello guys,

 Thanks a lot for the lengthy discussions. It helped me a lot to get a
 feeling on what is Julia like. I did some more performance comparisons as
 suggested by first two posts (thanks a lot for the tips). In the mean time
 I upgraded to v0.3.
  X = rand(7000,7000);
 @time d = sum(X.^2, 1);
 elapsed time: 0.573125833 seconds (392056672 bytes allocated, 2.25% gc
 time)
 @time d = sum(X.*X, 1);
 elapsed time: 0.178715901 seconds (392057080 bytes allocated, 14.06% gc
 time)
 @time d = sumabs2(X, 1);
 elapsed time: 0.067431808 seconds (56496 bytes allocated)

 In Octave then
  X = rand(7000);
 tic; d = sum(X.^2); toc;
 Elapsed time is 0.167578 seconds.

 So the ultimate solution is the sumabs2 function which is a blast. I am
 comming from Matlab/Octave and I would expect X.^2 to be fast out of the
 box but nevertheless if I can get an excellent performance by learning
 some new paradigms I will go for it.

 The above tests lead me to another question. I often need to calculate
 the self dot product over a portion of a matrix, e.g.
  @time d = sumabs2(X[:,1001:end], 1);
 elapsed time: 0.17566 seconds (336048688 bytes allocated, 7.01% gc
 time)

 Apparently this is not a way to do it in Julia because working on a
 smaller matrix of 7000x6000 gives more than double computing time and
 furthermore it seems to allocate unnecessary memory.

 Best Regards,
 Jan



 Dňa pondelok, 8. septembra 2014 10:36:02 UTC+2 Ján Dolinský napísal(-a):

 Hello,

 I am a new Julia user. I am trying to write a function for computing
 self dot product of all columns in a matrix, i.e. calculating a square of
 each element of a matrix and computing a column-wise sum. I am interested
 in a proper way of doing it because I often need to process large matrices.

 I first put a focus on calculating the squares. For testing purposes I
 use a matrix of random floats of size 7000x7000. All timings here are
 deducted after several repetitive runs.

 I used to do it in Octave (v3.8.1) a follows:
  tic; X = rand(7000); toc;
 Elapsed time is 0.579093 seconds.
 tic; XX = X.^2; toc;
 Elapsed time is 0.114737 seconds.


 I tried to to the same in Julia (v0.2.1):
  @time X = rand(7000,7000);
 elapsed time: 0.114418731 seconds (392000128 bytes allocated)
 @time XX = X.^2;
 elapsed time: 0.369641268 seconds (392000224 bytes allocated)

 I was surprised to see that Julia is about 3 times slower when
 calculating a square than my original routine in Octave. I then read
 Performance tips and found out that one should use * instead of of
 raising to small integer powers, for example x*x*x instead of x^3. I
 therefore tested the following.
  @time XX = X.*X;
 elapsed time: 0.146059577 seconds (392000968 bytes allocated)

 This approach indeed resulted in a lot shorter computing time. It is
 still however a little slower than my code in Octave. Can someone advise on
 any performance tips ?

 I then finally do a sum over all columns of XX to get the self dot
 product but first I'd like to fix the squaring part.

 Thanks a lot.
 Best Regards,
 Jan

 p.s. In Julia manual I found a while ago an example of using @vectorize
 macro with a squaring function but can not find it any more. Perhaps the
 name of macro was different ...







Re: [julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-09 Thread Ján Dolinský

Hello Andreas,

Thanks for the tip. I'll check it out. Thumbs up for the 0.4!

Jan

On 09.09.2014 17:04, Andreas Noack wrote:
If you need the speed now you can try one of the package ArrayViews or 
ArrayViewsAPL. It is something similar to the functionality in these 
packages that we are trying to include in base.


Med venlig hilsen

Andreas Noack

2014-09-09 9:38 GMT-04:00 Ján Dolinský jan.dolin...@2bridgz.com 
mailto:jan.dolin...@2bridgz.com:


OK, so basically there is nothing wrong with the syntax
X[:,1001:end] ?
|
d =sumabs2(X[:,1001:end],1);
|
||and I should just wait until v0.4 is available (perhaps
available soon in Julia Nightlies PPA).

I did the benchmark with the floating point power function based
on Simon's comment. Here are my results (after couple of
repetitive iterations):
|
@time X.^2;
elapsed time: 0.511988142 seconds (392000256 bytes allocated,
2.52% gc time)
@time X.^2.0;
elapsed time: 0.411791612 seconds (392000256 bytes allocated,
3.12% gc time)
|

Thanks,
Jan Dolinsky

On 09.09.2014 14:06, Andreas Noack wrote:

The problem is that right now X[:,1001,end] makes a copy of the
array. However,  in 0.4 this will instead be a view of the
original matrix and therefore the computing time should be almost
the same.

It might also be worth repeating Simon's comment that the
floating point power function has special handling of 2. The
result is that

julia @time A.^2;
elapsed time: 1.402791357 seconds (20256 bytes allocated,
5.90% gc time)

julia @time A.^2.0;
elapsed time: 0.554241105 seconds (20256 bytes allocated,
15.04% gc time)

I tend to agree with Simon that special casing of integer 2 would
be reasonable.

Med venlig hilsen

Andreas Noack

2014-09-09 tel:2014-09-09 4:24 GMT-04:00 Ján Dolinský
jan.dolin...@2bridgz.com mailto:jan.dolin...@2bridgz.com:

Hello guys,

Thanks a lot for the lengthy discussions. It helped me a lot
to get a feeling on what is Julia like. I did some more
performance comparisons as suggested by first two posts
(thanks a lot for the tips). In the mean time I upgraded to v0.3.
|
X =rand(7000,7000);
@timed =sum(X.^2,1);
elapsed time:0.573125833seconds (392056672bytes
allocated,2.25%gc time)
@timed =sum(X.*X,1);
elapsed time:0.178715901seconds (392057080bytes
allocated,14.06%gc time)
@timed =sumabs2(X,1);
elapsed time:0.067431808seconds (56496bytes allocated)
|

In Octave then
|
X =rand(7000);
tic;d =sum(X.^2);toc;
Elapsedtime is0.167578seconds.
|

So the ultimate solution is the sumabs2 function which is a
blast. I am comming from Matlab/Octave and I would expect
X.^2 to be fast out of the box but nevertheless if I can
get an excellent performance by learning some new paradigms I
will go for it.

The above tests lead me to another question. I often need to
calculate the self dot product over a portion of a matrix, e.g.
|
@timed =sumabs2(X[:,1001:end],1);
elapsed time:0.17566seconds (336048688bytes
allocated,7.01%gc time)
|

Apparently this is not a way to do it in Julia because
working on a smaller matrix of 7000x6000 gives more than
double computing time and furthermore it seems to allocate
unnecessary memory.

Best Regards,
Jan



Dňa pondelok, 8. septembra 2014 10:36:02 UTC+2 Ján Dolinský
napísal(-a):

Hello,

I am a new Julia user. I am trying to write a function
for computing self dot product of all columns in a
matrix, i.e. calculating a square of each element of a
matrix and computing a column-wise sum. I am interested
in a proper way of doing it because I often need to
process large matrices.

I first put a focus on calculating the squares. For
testing purposes I use a matrix of random floats of size
7000x7000. All timings here are deducted after several
repetitive runs.

I used to do it in Octave (v3.8.1) a follows:
|
tic;X =rand(7000);toc;
Elapsedtime is0.579093seconds.
tic;XX =X.^2;toc;
Elapsedtime is0.114737seconds.
|


I tried to to the same in Julia (v0.2.1):
|
@timeX =rand(7000,7000);
elapsed time:0.114418731seconds (392000128bytes allocated)
@timeXX =X.^2;
elapsed time:0.369641268seconds (392000224bytes allocated)
|

I was surprised to see that Julia is about 3 times slower
when calculating a square than my original routine in
 

[julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-08 Thread Tony Kelman
I recommend you read through this blog post - 
http://julialang.org/blog/2013/09/fast-numeric/
And 
maybe 
http://www.johnmyleswhite.com/notebook/2013/12/22/the-relationship-between-vectorized-and-devectorized-code/
 
too while you're at it, just replace R with Octave or Matlab and the 
conclusion is basically the same.

The macro you were looking for was probably @devec from the Devectorize.jl 
package.

If you don't actually need the intermediate matrix full of the 49 million 
squared values, it's a huge waste of resources to allocate memory to store 
them all. If you upgrade to Julia 0.3.0, there's a function sumabs2(x, dim) 
that computes sum(abs(x).^2, dim) without allocating intermediate temporary 
arrays for abs(x).^2. On my laptop sumabs2(X, 1) is about 6 times faster 
and allocates hardly any memory at all relative to sum(X.*X, 1) which needs 
to allocate almost 400 MB.


On Monday, September 8, 2014 1:36:02 AM UTC-7, Ján Dolinský wrote:

 Hello,

 I am a new Julia user. I am trying to write a function for computing 
 self dot product of all columns in a matrix, i.e. calculating a square of 
 each element of a matrix and computing a column-wise sum. I am interested 
 in a proper way of doing it because I often need to process large matrices.

 I first put a focus on calculating the squares. For testing purposes I use 
 a matrix of random floats of size 7000x7000. All timings here are deducted 
 after several repetitive runs.

 I used to do it in Octave (v3.8.1) a follows:
 tic; X = rand(7000); toc;
 Elapsed time is 0.579093 seconds.
 tic; XX = X.^2; toc;
 Elapsed time is 0.114737 seconds.


 I tried to to the same in Julia (v0.2.1):
 @time X = rand(7000,7000);
 elapsed time: 0.114418731 seconds (392000128 bytes allocated)
 @time XX = X.^2;
 elapsed time: 0.369641268 seconds (392000224 bytes allocated)

 I was surprised to see that Julia is about 3 times slower when calculating 
 a square than my original routine in Octave. I then read Performance tips 
 and found out that one should use * instead of of raising to small integer 
 powers, for example x*x*x instead of x^3. I therefore tested the 
 following.
 @time XX = X.*X;
 elapsed time: 0.146059577 seconds (392000968 bytes allocated)

 This approach indeed resulted in a lot shorter computing time. It is still 
 however a little slower than my code in Octave. Can someone advise on any 
 performance tips ?

 I then finally do a sum over all columns of XX to get the self dot 
 product but first I'd like to fix the squaring part.

 Thanks a lot. 
 Best Regards,
 Jan 

 p.s. In Julia manual I found a while ago an example of using @vectorize 
 macro with a squaring function but can not find it any more. Perhaps the 
 name of macro was different ... 
   



[julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-08 Thread Mohammed El-Beltagy
Octave seems to be doing a pretty good job for this type of calculation. If 
you want to squeeze a bit more performance out of Julia, you can try to use 
explicit loops (devectorize as in 
http://julialang.org/blog/2013/09/fast-numeric/). Then you might also 
remove bounds checking in a loop for faster performance. 

Try this function:
function doCalc(x::Array{Float64,2})
XX=Array(Float64, 7000,7000)
for j=1:7000,i=1:7000
@inbounds XX[i,j]=x[i,j]*x[i,j]
end
XX
end

Followed by 
@time XX=doCalc(X);



On Monday, September 8, 2014 11:36:02 AM UTC+3, Ján Dolinský wrote:

 Hello,

 I am a new Julia user. I am trying to write a function for computing 
 self dot product of all columns in a matrix, i.e. calculating a square of 
 each element of a matrix and computing a column-wise sum. I am interested 
 in a proper way of doing it because I often need to process large matrices.

 I first put a focus on calculating the squares. For testing purposes I use 
 a matrix of random floats of size 7000x7000. All timings here are deducted 
 after several repetitive runs.

 I used to do it in Octave (v3.8.1) a follows:
 tic; X = rand(7000); toc;
 Elapsed time is 0.579093 seconds.
 tic; XX = X.^2; toc;
 Elapsed time is 0.114737 seconds.


 I tried to to the same in Julia (v0.2.1):
 @time X = rand(7000,7000);
 elapsed time: 0.114418731 seconds (392000128 bytes allocated)
 @time XX = X.^2;
 elapsed time: 0.369641268 seconds (392000224 bytes allocated)

 I was surprised to see that Julia is about 3 times slower when calculating 
 a square than my original routine in Octave. I then read Performance tips 
 and found out that one should use * instead of of raising to small integer 
 powers, for example x*x*x instead of x^3. I therefore tested the 
 following.
 @time XX = X.*X;
 elapsed time: 0.146059577 seconds (392000968 bytes allocated)

 This approach indeed resulted in a lot shorter computing time. It is still 
 however a little slower than my code in Octave. Can someone advise on any 
 performance tips ?

 I then finally do a sum over all columns of XX to get the self dot 
 product but first I'd like to fix the squaring part.

 Thanks a lot. 
 Best Regards,
 Jan 

 p.s. In Julia manual I found a while ago an example of using @vectorize 
 macro with a squaring function but can not find it any more. Perhaps the 
 name of macro was different ... 
   



[julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-08 Thread Jeff Waller


On Monday, September 8, 2014 7:37:52 AM UTC-4, Mohammed El-Beltagy wrote:

 Octave seems to be doing a pretty good job for this type of calculation. 
 If you want to squeeze a bit more performance out of Julia, you can try to 
 use explicit loops (devectorize as in 
 http://julialang.org/blog/2013/09/fast-numeric/). Then you might also 
 remove bounds checking in a loop for faster performance. 

 Try this function:
 function doCalc(x::Array{Float64,2})
 XX=Array(Float64, 7000,7000)
 for j=1:7000,i=1:7000
 @inbounds XX[i,j]=x[i,j]*x[i,j]
 end
 XX
 end

 Followed by 
 @time XX=doCalc(X);



I am totally not cool with this. Just simply making things fast at whatever 
the cost is not good enough. You can't tell someone that Oh sure, Julia 
does support that extremely convenient syntax, but because it's 6 times 
slower (and it is), you need to essentially write C.  There's no reason 
that Julia for this should be any less fast than Octave except for bugs 
which will eventually be fixed.  I think it's perfectly ok to say 
devectorize if you want to do even better, but it must be at least equal.

Here's the implementation of .^

.^(x::StridedArray, y::Number) = reshape([ x[i] ^ y for i=1:length(x) ], 
size(x))
how is that made less vectorized?

Yes R, Matlab, Octave do conflate vectorization for speed and style and 
that's a mistake.  Julia tries not to do that that's better.  But do you 
expect people to throw away the Matlab syntax?  No way.

for me:

*julia **x = rand(7000,7000);*

*julia **@time x.^2;*

elapsed time: 1.340131226 seconds (392000256 bytes allocated)

And in Octave

 x=rand(7000);

 tic;y=x.^2;toc

Elapsed time is 0.201613 seconds.

Comprehensions might be the culprit, or simply use of the generic x^y 
 here's an alternate implementation I threw together

*julia **function .^{T}(x::Array{T},y::Number)*

   *   z = *(size(x)...)*

   *   new_x = Array(T,z);*

   *   tmp_x = reshape(x,z);*

   *   for i in 1:z*

   *  @inbounds if y == 2*

   * new_x[i] = tmp_x[i]*tmp_x[i]*

   *  else*

   * new_x[i] = tmp_x[i]^y*

   *  end*

   *   end*

   *   reshape(new_x,size(x))*

   *end*

*.^ (generic function with 24 methods)*

*julia **@time x.^2;*

elapsed time: 0.248816016 seconds (392000360 bytes allocated, 4.73% gc time)

Close




[julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-08 Thread Tony Kelman
Calm down. It's very easy to match Octave here, by writing the standard 
library in C. Julia chose not to go that route. Of course everyone would 
like the concise vectorized syntax to be as fast as the for-loop syntax. If 
it were easy to do that *without* writing the standard library in C, it 
would've been done already. In the meantime it isn't easy and hasn't been 
done yet, so when someone asks how to write code that's faster than Octave, 
this is the answer.

There have been various LLVM bugs regarding automatic optimizations of 
converting small integer powers to multiplications. if y == 2 code really 
shouldn't be necessary here, LLVM is smarter than that but sometimes fickle 
and buggy under the demands Julia puts on it.

And in this case the answer is clearly a reduction for which Julia has a 
very good set of tools to express efficiently. If your output is much less 
data than your input, who cares how fast you can calculate temporaries that 
you don't need to save?


On Monday, September 8, 2014 9:18:02 AM UTC-7, Jeff Waller wrote:



 On Monday, September 8, 2014 7:37:52 AM UTC-4, Mohammed El-Beltagy wrote:

 Octave seems to be doing a pretty good job for this type of calculation. 
 If you want to squeeze a bit more performance out of Julia, you can try to 
 use explicit loops (devectorize as in 
 http://julialang.org/blog/2013/09/fast-numeric/). Then you might also 
 remove bounds checking in a loop for faster performance. 

 Try this function:
 function doCalc(x::Array{Float64,2})
 XX=Array(Float64, 7000,7000)
 for j=1:7000,i=1:7000
 @inbounds XX[i,j]=x[i,j]*x[i,j]
 end
 XX
 end

 Followed by 
 @time XX=doCalc(X);



 I am totally not cool with this. Just simply making things fast at 
 whatever the cost is not good enough. You can't tell someone that Oh sure, 
 Julia does support that extremely convenient syntax, but because it's 6 
 times slower (and it is), you need to essentially write C.  There's no 
 reason that Julia for this should be any less fast than Octave except for 
 bugs which will eventually be fixed.  I think it's perfectly ok to say 
 devectorize if you want to do even better, but it must be at least equal.

 Here's the implementation of .^

 .^(x::StridedArray, y::Number) = reshape([ x[i] ^ y for i=1:length(x) ], 
 size(x))
 how is that made less vectorized?

 Yes R, Matlab, Octave do conflate vectorization for speed and style and 
 that's a mistake.  Julia tries not to do that that's better.  But do you 
 expect people to throw away the Matlab syntax?  No way.

 for me:

 *julia **x = rand(7000,7000);*

 *julia **@time x.^2;*

 elapsed time: 1.340131226 seconds (392000256 bytes allocated)

 And in Octave

  x=rand(7000);

  tic;y=x.^2;toc

 Elapsed time is 0.201613 seconds.

 Comprehensions might be the culprit, or simply use of the generic x^y 
  here's an alternate implementation I threw together

 *julia **function .^{T}(x::Array{T},y::Number)*

*   z = *(size(x)...)*

*   new_x = Array(T,z);*

*   tmp_x = reshape(x,z);*

*   for i in 1:z*

*  @inbounds if y == 2*

* new_x[i] = tmp_x[i]*tmp_x[i]*

*  else*

* new_x[i] = tmp_x[i]^y*

*  end*

*   end*

*   reshape(new_x,size(x))*

*end*

 *.^ (generic function with 24 methods)*

 *julia **@time x.^2;*

 elapsed time: 0.248816016 seconds (392000360 bytes allocated, 4.73% gc 
 time)

 Close




[julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-08 Thread Douglas Bates
On Monday, September 8, 2014 11:18:02 AM UTC-5, Jeff Waller wrote:



 On Monday, September 8, 2014 7:37:52 AM UTC-4, Mohammed El-Beltagy wrote:

 Octave seems to be doing a pretty good job for this type of calculation. 
 If you want to squeeze a bit more performance out of Julia, you can try to 
 use explicit loops (devectorize as in 
 http://julialang.org/blog/2013/09/fast-numeric/). Then you might also 
 remove bounds checking in a loop for faster performance. 

 Try this function:
 function doCalc(x::Array{Float64,2})
 XX=Array(Float64, 7000,7000)
 for j=1:7000,i=1:7000
 @inbounds XX[i,j]=x[i,j]*x[i,j]
 end
 XX
 end

 Followed by 
 @time XX=doCalc(X);



 I am totally not cool with this. Just simply making things fast at 
 whatever the cost is not good enough. You can't tell someone that Oh sure, 
 Julia does support that extremely convenient syntax, but because it's 6 
 times slower (and it is), you need to essentially write C.  There's no 
 reason that Julia for this should be any less fast than Octave except for 
 bugs which will eventually be fixed.  I think it's perfectly ok to say 
 devectorize if you want to do even better, but it must be at least equal.

 Here's the implementation of .^

 .^(x::StridedArray, y::Number) = reshape([ x[i] ^ y for i=1:length(x) ], 
 size(x))
 how is that made less vectorized?


But the original question was how to evaluate the squared lengths of the 
columns of a large matrix, for which one answer is

sumabs2(X,1)

The reason that the discussion switched to evaluating componentwise powers 
of matrices is because that is the natural way to approach this in R, 
Matlab, Octave, ..., where running a loop, which is what is going on behind 
sumabs2, is expensive.

Julia is different from R, Matlab, Octave, ...  To use it effectively you 
need to learn new idioms.


 Yes R, Matlab, Octave do conflate vectorization for speed and style and 
 that's a mistake.  Julia tries not to do that that's better.  But do you 
 expect people to throw away the Matlab syntax?  No way.

 for me:

 *julia **x = rand(7000,7000);*

 *julia **@time x.^2;*

 elapsed time: 1.340131226 seconds (392000256 bytes allocated)

 And in Octave

  x=rand(7000);

  tic;y=x.^2;toc

 Elapsed time is 0.201613 seconds.

 Comprehensions might be the culprit, or simply use of the generic x^y 
  here's an alternate implementation I threw together

 *julia **function .^{T}(x::Array{T},y::Number)*

*   z = *(size(x)...)*

*   new_x = Array(T,z);*

*   tmp_x = reshape(x,z);*

*   for i in 1:z*

*  @inbounds if y == 2*

* new_x[i] = tmp_x[i]*tmp_x[i]*

*  else*

* new_x[i] = tmp_x[i]^y*

*  end*

*   end*

*   reshape(new_x,size(x))*

*end*

 *.^ (generic function with 24 methods)*

 *julia **@time x.^2;*

 elapsed time: 0.248816016 seconds (392000360 bytes allocated, 4.73% gc 
 time)

 Close




[julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-08 Thread Jeff Waller


On Monday, September 8, 2014 12:37:16 PM UTC-4, Tony Kelman wrote:

 Calm down. It's very easy to match Octave here, by writing the standard 
 library in C. Julia chose not to go that route. Of course everyone would 
 like the concise vectorized syntax to be as fast as the for-loop syntax. If 
 it were easy to do that *without* writing the standard library in C, it 
 would've been done already. In the meantime it isn't easy and hasn't been 
 done yet, so when someone asks how to write code that's faster than Octave, 
 this is the answer.


Haha, why do I come across as combative??? HUH?  Oh wait that's kind of 
combative.  But, wait,from above  I think it's been illustrated that a 
C-based rewrite is not necessary, but the current form is no good.  
 


 There have been various LLVM bugs regarding automatic optimizations of 
 converting small integer powers to multiplications. if y == 2 code really 
 shouldn't be necessary here, LLVM is smarter than that but sometimes fickle 
 and buggy under the demands Julia puts on it.


Apparently.  For some additional info.  Check this out  if you change the 
Octave example to this:

 tic;y=x.^2.1;toc
Elapsed time is 1.80085 seconds.

Now it's slower than Julia. Like you said, you can't trust LLVM on 
everything, so the question is what to do about it?
 


 And in this case the answer is clearly a reduction for which Julia has a 
 very good set of tools to express efficiently. If your output is much less 
 data than your input, who cares how fast you can calculate temporaries that 
 you don't need to save?


I think this is missing the point and concentrating on optimizing the wrong 
thing or possibly I'm missing your point.  Here's my point

Everyone coming from Octave/Matlab will expect and should expect that  X.^2 
is usable and just as fast.



[julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-08 Thread Simon Kornblith
Actually, if you want this to be fast I don't think you can avoid the if y 
== 2 branch, although ideally it should be outside the loop and not inside 
it. LLVM will optimize x^2, but if it doesn't know what the y in x^y is at 
compile time, it will just call libm. It's not going to emit a branch to 
handle the case where y == 2, since it has no idea what arguments the 
function will actually be called with and it doesn't want to slow things 
down if it's not called with y == 2. It just so happens that X.^2 is pretty 
common in real life.

In the Octave source I see that Octave special-cases X.^2, X.^3, and 
X.^(-1). (Notice how X.^4 is ~6x slower.) We could do the same, or inline 
the loop so that LLVM's optimizers can get a crack at it in the constant 
power case, or dynamically generate a specialized loop that uses multiply 
operations for arbitrary integer powers (but see here 
https://github.com/JuliaLang/julia/issues/2741#issuecomment-47008166 for 
a caveat: this will be faster than calling libm, but it will be less 
accurate). But if you want a fast version of X.^2, right now you can just 
call abs2(X).

Tony is right that sumabs2(X, 1) will be faster, both versus sum(abs2(X), 1) 
and versus anything you could achieve with Octave, since it avoids 
allocating a temporary array and exploits BLAS. (On my machine it's about 
9x faster than sum(X.^2, 1) in Octave.) But on my machine sum(abs2(X), 1) 
in Julia is only a few percent slower than sum(X.^2, 1) in Octave if that's 
all you want.

Simon

On Monday, September 8, 2014 12:37:16 PM UTC-4, Tony Kelman wrote:

 Calm down. It's very easy to match Octave here, by writing the standard 
 library in C. Julia chose not to go that route. Of course everyone would 
 like the concise vectorized syntax to be as fast as the for-loop syntax. If 
 it were easy to do that *without* writing the standard library in C, it 
 would've been done already. In the meantime it isn't easy and hasn't been 
 done yet, so when someone asks how to write code that's faster than Octave, 
 this is the answer.

 There have been various LLVM bugs regarding automatic optimizations of 
 converting small integer powers to multiplications. if y == 2 code really 
 shouldn't be necessary here, LLVM is smarter than that but sometimes fickle 
 and buggy under the demands Julia puts on it.

 And in this case the answer is clearly a reduction for which Julia has a 
 very good set of tools to express efficiently. If your output is much less 
 data than your input, who cares how fast you can calculate temporaries that 
 you don't need to save?


 On Monday, September 8, 2014 9:18:02 AM UTC-7, Jeff Waller wrote:



 On Monday, September 8, 2014 7:37:52 AM UTC-4, Mohammed El-Beltagy wrote:

 Octave seems to be doing a pretty good job for this type of calculation. 
 If you want to squeeze a bit more performance out of Julia, you can try to 
 use explicit loops (devectorize as in 
 http://julialang.org/blog/2013/09/fast-numeric/). Then you might also 
 remove bounds checking in a loop for faster performance. 

 Try this function:
 function doCalc(x::Array{Float64,2})
 XX=Array(Float64, 7000,7000)
 for j=1:7000,i=1:7000
 @inbounds XX[i,j]=x[i,j]*x[i,j]
 end
 XX
 end

 Followed by 
 @time XX=doCalc(X);



 I am totally not cool with this. Just simply making things fast at 
 whatever the cost is not good enough. You can't tell someone that Oh sure, 
 Julia does support that extremely convenient syntax, but because it's 6 
 times slower (and it is), you need to essentially write C.  There's no 
 reason that Julia for this should be any less fast than Octave except for 
 bugs which will eventually be fixed.  I think it's perfectly ok to say 
 devectorize if you want to do even better, but it must be at least equal.

 Here's the implementation of .^


 .^(x::StridedArray, y::Number) = reshape([ x[i] ^ y for i=1:length(x) ], 
 size(x))
 how is that made less vectorized?

 Yes R, Matlab, Octave do conflate vectorization for speed and style and 
 that's a mistake.  Julia tries not to do that that's better.  But do you 
 expect people to throw away the Matlab syntax?  No way.

 for me:

 *julia **x = rand(7000,7000);*

 *julia **@time x.^2;*

 elapsed time: 1.340131226 seconds (392000256 bytes allocated)

 And in Octave

  x=rand(7000);

  tic;y=x.^2;toc

 Elapsed time is 0.201613 seconds.

 Comprehensions might be the culprit, or simply use of the generic x^y 
  here's an alternate implementation I threw together

 *julia **function .^{T}(x::Array{T},y::Number)*

*   z = *(size(x)...)*

*   new_x = Array(T,z);*

*   tmp_x = reshape(x,z);*

*   for i in 1:z*

*  @inbounds if y == 2*

* new_x[i] = tmp_x[i]*tmp_x[i]*

*  else*

* new_x[i] = tmp_x[i]^y*

*  end*

*   end*

*   reshape(new_x,size(x))*

*end*

 *.^ (generic 

[julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-08 Thread Tony Kelman
 http://www.spaceflightnow.com/news/n1408/27slscommitment/#.VA3nNPmwIpA

Identify actionable improvements. Maybe adding a branch somewhere and 
specializing on the most common powers is an acceptable solution. Best 
answer is fix LLVM bugs, some of this may go away when we switch to LLVM 
3.5 which was just released, there's an open issue and several other 
problems so it won't be a simple version bump.

 Everyone coming from Octave/Matlab will expect and should expect that 
 X.^2 is usable and just as fast.

Sure. Part of my point is because this is the only fast way to do things in 
Octave/Matlab, you're led to want to do things this way even in situations 
where it's not necessary or desirable. Do you actually need to make a new 
separate copy of the entire array? If not, you could square the elements 
in-place to save on allocation. Do you really need the squared value of 
every single element? If not, a reduction is a much faster way to get the 
information you're looking for.

Expectations based on the way Octave/Matlab work are really not applicable 
to a system that works completely differently.



[julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-08 Thread Tony Kelman
[whoops, mis-pasted this]

 so the question is what to do about it?

Identify actionable improvements. Maybe adding a branch somewhere and 
specializing on the most common powers is an acceptable solution. Best 
answer is fix LLVM bugs, some of this may go away when we switch to LLVM 
3.5 which was just released, there's an open issue and several other 
problems so it won't be a simple version bump.

 Everyone coming from Octave/Matlab will expect and should expect that 
 X.^2 is usable and just as fast.

Sure. Part of my point is because this is the only fast way to do things in 
Octave/Matlab, you're led to want to do things this way even in situations 
where it's not necessary or desirable. Do you actually need to make a new 
separate copy of the entire array? If not, you could square the elements 
in-place to save on allocation. Do you really need the squared value of 
every single element? If not, a reduction is a much faster way to get the 
information you're looking for.

Expectations based on the way Octave/Matlab work are really not applicable 
to a system that works completely differently.



Re: [julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-08 Thread Stefan Karpinski
On Mon, Sep 8, 2014 at 7:48 PM, Tony Kelman t...@kelman.net wrote:


 Expectations based on the way Octave/Matlab work are really not applicable
 to a system that works completely differently.


That said, we do want to make this faster – but we don't want to cheat to
do it, and special casing small powers is kind of cheating. In this case it
may be worth it, but what we'd really like is a more general kind of
optimization here that has the same effect in this particular case.


Re: [julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-08 Thread Simon Kornblith
I don't think there's anything wrong with specializing small powers. We 
already specialize small powers in openlibm's pow and even in inference.jl. 
If you want to compute x^2 at maximum speed, something, somewhere needs 
that when the power is 2 it should just call mulsd instead of doing 
whatever magic it does for arbitrary powers. Translating integer exponents 
to multiplies only kind of works for powers higher than 3 anyway; it will 
be faster but not as accurate as calling libm.

Another option would be to use a vector math library, which would make 
vectorized versions of everything faster. As far as I'm aware VML is still 
the only option that claims 1 ulp accuracy, which does indeed provide a 
substantial 
performance boost https://github.com/simonster/VML.jl#performance (and 
also seems to special case x^2). I don't think we could actually use it in 
Base, though; even if the VML license allows it, I don't think we could 
link against Rmath, UMFPACK, and whatever other GPL libraries if we did.

Simon

On Monday, September 8, 2014 4:02:34 PM UTC-4, Stefan Karpinski wrote:

 On Mon, Sep 8, 2014 at 7:48 PM, Tony Kelman to...@kelman.net 
 javascript: wrote:


 Expectations based on the way Octave/Matlab work are really not 
 applicable to a system that works completely differently.


 That said, we do want to make this faster – but we don't want to cheat to 
 do it, and special casing small powers is kind of cheating. In this case it 
 may be worth it, but what we'd really like is a more general kind of 
 optimization here that has the same effect in this particular case.



Re: [julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-08 Thread Jameson Nash
there already is a special inference pass for this:
https://github.com/JuliaLang/julia/blob/3b2269d835e420d16ebf6758292213bcd896df8f/base/inference.jl#L2633

but it doesn't look like it is applicable to arrays

perhaps it should be?

or perhaps, for small isint() value's, we could try using
Base.power_by_squaring?

regardless, running your benchmarks in global scope is unlikely to ever
give an accurate representation of the full range of optimizations Julia
can attempt

 You can't tell someone that Oh sure, Julia does support that extremely
convenient syntax, but because it's 6 times slower (and it is), you need to
essentially write C.

No, you don't need to write C code. That would be the advice you get from
Matlab, Octave, Python, and most other dynamic languages. What Julia says,
is that you can write for-loops without the usual massive performance
penalty.

A wise man once told me (ok, it was Jeff, and it was about 2 years ago)
that if he wanted to make a faster MATLAB, he would have spent the last few
years writing optimized kernels for Octave. But he instead wanted to make a
new language entirely.



On Mon, Sep 8, 2014 at 4:01 PM, Stefan Karpinski ste...@karpinski.org
wrote:

 On Mon, Sep 8, 2014 at 7:48 PM, Tony Kelman t...@kelman.net wrote:


 Expectations based on the way Octave/Matlab work are really not
 applicable to a system that works completely differently.


 That said, we do want to make this faster – but we don't want to cheat to
 do it, and special casing small powers is kind of cheating. In this case it
 may be worth it, but what we'd really like is a more general kind of
 optimization here that has the same effect in this particular case.



Re: [julia-users] Re: self dot product of all columns in a matrix: Julia vs. Octave

2014-09-08 Thread Jeff Waller
I feel that x^2 must among the things that are convenient and fast; it's 
one of those language elements that people simply depend on.  If it's 
inconvenient, then people are going to experience that over and over.  If 
it's not fast enough people are going experience slowness over and over.  

Like Simon says it's already a thing 
https://github.com/JuliaLang/julia/issues/2741.  Maybe use of something 
like VML is an option or necessary, or maybe extending inference.jl or 
maybe even it eventually might not be necessary 
http://llvm.org/docs/Vectorizers.html.   

Julia is a fundamental improvement, but give Matlab and Octave their due, 
that syntax is great.  When I say essentially C, to me it means

Option A use built in infix:

X.^y

versus Option B go write a function

pow(x,y)
   for i = 1 ...
  for j =  
   ...
   etc

Option A is just too good, it has to be supported. 

What to do?  Is this a fundamental flaw?  No I don't think so.  Is this a 
one time only thing?  It feels like no, this is one of many things that 
will occasionally occur.  Is it possible to make this a hassle?  Like I 
think Tony is saying, can/should there be a branch of immediate 
optimizations?  Stuff that would eventually be done in a better way but 
needs to be completed now.  It's a branch so things can be collected and 
retracted more coherently.