[julia-users] Re: I can't believe this spped-up !

2016-07-23 Thread Ferran Mazzanti
Hi Roger,

that makes a lot of sense to me... I'll be careful also with globals. Still 
if the mechanism is the one you mention, there is something fuzzy here as 
the timmings I posted are right, human-wise, in the sense that the reported 
times were the ones I actually had to wait in front of my computer to get 
the result. Shall I understand then that top-level loops are highly 
unoptimized (?) ??

Best,

Ferran.

On Friday, July 22, 2016 at 2:52:23 PM UTC+2, Roger Whitney wrote:
>
> Instead of using tic toc use @time to time your loops. You will find that 
> in your sequential loop you are allocating a lot of memory, while the 
> @parallel loop does not. The difference in time is due to the memory 
> allocation. One of my students ran into this earlier this week and that was 
> the cause  in his case.  My understanding is that the compiler does not 
> optimize for loops done at the top level.  When you put the sequential loop 
> in a function the excessive memory goes away, which makes the sequential 
> loop faster.   
>
> You need to be careful using @parallel with no worker process.  With no 
> workers the @parallel loop can modify globals and you will get the correct 
> result because it is all done in the same process. When you add workers the 
> globals will be copied to each worker and the changes will be done on the 
> workers copy and the result is not copied back to the master process. So 
> code that works with no workers will break when using drugs workers.



[julia-users] Re: I can't believe this spped-up !

2016-07-22 Thread Roger Whitney
Instead of using tic toc use @time to time your loops. You will find that in 
your sequential loop you are allocating a lot of memory, while the @parallel 
loop does not. The difference in time is due to the memory allocation. One of 
my students ran into this earlier this week and that was the cause  in his 
case.  My understanding is that the compiler does not optimize for loops done 
at the top level.  When you put the sequential loop in a function the excessive 
memory goes away, which makes the sequential loop faster.  

You need to be careful using @parallel with no worker process.  With no workers 
the @parallel loop can modify globals and you will get the correct result 
because it is all done in the same process. When you add workers the globals 
will be copied to each worker and the changes will be done on the workers copy 
and the result is not copied back to the master process. So code that works 
with no workers will break when using drugs workers.

[julia-users] Re: I can't believe this spped-up !

2016-07-21 Thread 'Greg Plowman' via julia-users
and also compare (note the @sync)

@time @sync @parallel for i in 1:10
sleep(1)
end

Also note that using reduction with @parallel will also wait:
 z = @parallel (*) for i = 1:n
 A
 end


On Friday, July 22, 2016 at 3:11:15 AM UTC+10, Kristoffer Carlsson wrote:

>
>
> julia> @time for i in 1:10
>sleep(1)
>end
>  10.054067 seconds (60 allocations: 3.594 KB)
>
>
> julia> @time @parallel for i in 1:10
>sleep(1)
>end
>   0.195556 seconds (28.91 k allocations: 1.302 MB)
> 1-element Array{Future,1}:
>  Future(1,1,8,#NULL)
>
>
>
> On Thursday, July 21, 2016 at 6:00:47 PM UTC+2, Ferran Mazzanti wrote:
>>
>> Hi,
>>
>> mostly showing my astonishment, but I can even understand the figures in 
>> this stupid parallelization code
>>
>> A = [[1.0 1.0001];[1.0002 1.0003]]
>> z = A
>> tic()
>> for i in 1:10
>> z *= A
>> end
>> toc()
>> A
>>
>> produces
>>
>> elapsed time: 105.458639263 seconds
>>
>> 2x2 Array{Float64,2}:
>>  1.0 1.0001
>>  1.0002  1.0003
>>
>>
>>
>> But then add @parallel in the for loop
>>
>> A = [[1.0 1.0001];[1.0002 1.0003]]
>> z = A
>> tic()
>> @parallel for i in 1:10
>> z *= A
>> end
>> toc()
>> A
>>
>> and get 
>>
>> elapsed time: 0.008912282 seconds
>>
>> 2x2 Array{Float64,2}:
>>  1.0 1.0001
>>  1.0002  1.0003
>>
>>
>> look at the elapsed time differences! And I'm running this on my Xeon 
>> desktop, not even a cluster
>> Of course A-B reports
>>
>> 2x2 Array{Float64,2}:
>>  0.0  0.0
>>  0.0  0.0
>>
>>
>> So is this what one should expect from this kind of simple 
>> paralleizations? If so, I'm definitely *in love* with Julia :):):)
>>
>> Best,
>>
>> Ferran.
>>
>>
>>

[julia-users] Re: I can't believe this spped-up !

2016-07-21 Thread Kristoffer Carlsson


julia> @time for i in 1:10
   sleep(1)
   end
 10.054067 seconds (60 allocations: 3.594 KB)


julia> @time @parallel for i in 1:10
   sleep(1)
   end
  0.195556 seconds (28.91 k allocations: 1.302 MB)
1-element Array{Future,1}:
 Future(1,1,8,#NULL)



On Thursday, July 21, 2016 at 6:00:47 PM UTC+2, Ferran Mazzanti wrote:
>
> Hi,
>
> mostly showing my astonishment, but I can even understand the figures in 
> this stupid parallelization code
>
> A = [[1.0 1.0001];[1.0002 1.0003]]
> z = A
> tic()
> for i in 1:10
> z *= A
> end
> toc()
> A
>
> produces
>
> elapsed time: 105.458639263 seconds
>
> 2x2 Array{Float64,2}:
>  1.0 1.0001
>  1.0002  1.0003
>
>
>
> But then add @parallel in the for loop
>
> A = [[1.0 1.0001];[1.0002 1.0003]]
> z = A
> tic()
> @parallel for i in 1:10
> z *= A
> end
> toc()
> A
>
> and get 
>
> elapsed time: 0.008912282 seconds
>
> 2x2 Array{Float64,2}:
>  1.0 1.0001
>  1.0002  1.0003
>
>
> look at the elapsed time differences! And I'm running this on my Xeon 
> desktop, not even a cluster
> Of course A-B reports
>
> 2x2 Array{Float64,2}:
>  0.0  0.0
>  0.0  0.0
>
>
> So is this what one should expect from this kind of simple 
> paralleizations? If so, I'm definitely *in love* with Julia :):):)
>
> Best,
>
> Ferran.
>
>
>

[julia-users] Re: I can't believe this spped-up !

2016-07-21 Thread Nathan Smith
in Jupyer notebook, add processors with addprocs(N) 

On Thursday, 21 July 2016 12:59:02 UTC-4, Nathan Smith wrote:
>
> To be clear, you need to compare the final 'z' not the final 'A' to check 
> if your calculations are consistent. The matrix A does not change through 
> out this calculation, but the matrix z does.
> Also, there is no parallelism with the @parallel loop unless your start 
> julia with 'julia -np N' where N is the number of processes you'd like to 
> use.
>
> On Thursday, 21 July 2016 12:45:17 UTC-4, Ferran Mazzanti wrote:
>>
>> Hi Nathan,
>>
>> I posted the codes, so you can check if they do the same thing or not. 
>> These went to separate cells in Jupyter, nothing more and nothing less.
>> Not even a single line I didn't post. And yes I understand your line of 
>> reasoning, so that's why I got astonished also.
>> But I can see what is making this huge difference, and I'd like to know :)
>>
>> Best,
>>
>> Ferran.
>>
>> On Thursday, July 21, 2016 at 6:31:57 PM UTC+2, Nathan Smith wrote:
>>>
>>> Hey Ferran, 
>>>
>>> You should be suspicious when your apparent speed up surpasses the level 
>>> of parallelism available on your CPU. I looks like your codes don't 
>>> actually compute the same thing.
>>>
>>> I'm assuming you're trying to compute the matrix exponential of A 
>>> (A^10) by repeatedly multiplying A. In your parallel code, each 
>>> process gets a local copy of 'z' and
>>> uses that. This means each process is computing something like 
>>> (A^(10/# of procs)). Check out this 
>>> 
>>>  section 
>>> of the documentation on parallel map and loops to see what I mean.
>>>
>>> That said, that doesn't explain your speed up completely, you should 
>>> also make sure that each part of your script is wrapped in a function and 
>>> that you 'warm-up' each function by running it once before comparing.
>>>
>>> Cheers, 
>>> Nathan
>>>
>>> On Thursday, 21 July 2016 12:00:47 UTC-4, Ferran Mazzanti wrote:

 Hi,

 mostly showing my astonishment, but I can even understand the figures 
 in this stupid parallelization code

 A = [[1.0 1.0001];[1.0002 1.0003]]
 z = A
 tic()
 for i in 1:10
 z *= A
 end
 toc()
 A

 produces

 elapsed time: 105.458639263 seconds

 2x2 Array{Float64,2}:
  1.0 1.0001
  1.0002  1.0003



 But then add @parallel in the for loop

 A = [[1.0 1.0001];[1.0002 1.0003]]
 z = A
 tic()
 @parallel for i in 1:10
 z *= A
 end
 toc()
 A

 and get 

 elapsed time: 0.008912282 seconds

 2x2 Array{Float64,2}:
  1.0 1.0001
  1.0002  1.0003


 look at the elapsed time differences! And I'm running this on my Xeon 
 desktop, not even a cluster
 Of course A-B reports

 2x2 Array{Float64,2}:
  0.0  0.0
  0.0  0.0


 So is this what one should expect from this kind of simple 
 paralleizations? If so, I'm definitely *in love* with Julia :):):)

 Best,

 Ferran.




[julia-users] Re: I can't believe this spped-up !

2016-07-21 Thread Nathan Smith
To be clear, you need to compare the final 'z' not the final 'A' to check 
if your calculations are consistent. The matrix A does not change through 
out this calculation, but the matrix z does.
Also, there is no parallelism with the @parallel loop unless your start 
julia with 'julia -np N' where N is the number of processes you'd like to 
use.

On Thursday, 21 July 2016 12:45:17 UTC-4, Ferran Mazzanti wrote:
>
> Hi Nathan,
>
> I posted the codes, so you can check if they do the same thing or not. 
> These went to separate cells in Jupyter, nothing more and nothing less.
> Not even a single line I didn't post. And yes I understand your line of 
> reasoning, so that's why I got astonished also.
> But I can see what is making this huge difference, and I'd like to know :)
>
> Best,
>
> Ferran.
>
> On Thursday, July 21, 2016 at 6:31:57 PM UTC+2, Nathan Smith wrote:
>>
>> Hey Ferran, 
>>
>> You should be suspicious when your apparent speed up surpasses the level 
>> of parallelism available on your CPU. I looks like your codes don't 
>> actually compute the same thing.
>>
>> I'm assuming you're trying to compute the matrix exponential of A 
>> (A^10) by repeatedly multiplying A. In your parallel code, each 
>> process gets a local copy of 'z' and
>> uses that. This means each process is computing something like 
>> (A^(10/# of procs)). Check out this 
>> 
>>  section 
>> of the documentation on parallel map and loops to see what I mean.
>>
>> That said, that doesn't explain your speed up completely, you should also 
>> make sure that each part of your script is wrapped in a function and that 
>> you 'warm-up' each function by running it once before comparing.
>>
>> Cheers, 
>> Nathan
>>
>> On Thursday, 21 July 2016 12:00:47 UTC-4, Ferran Mazzanti wrote:
>>>
>>> Hi,
>>>
>>> mostly showing my astonishment, but I can even understand the figures in 
>>> this stupid parallelization code
>>>
>>> A = [[1.0 1.0001];[1.0002 1.0003]]
>>> z = A
>>> tic()
>>> for i in 1:10
>>> z *= A
>>> end
>>> toc()
>>> A
>>>
>>> produces
>>>
>>> elapsed time: 105.458639263 seconds
>>>
>>> 2x2 Array{Float64,2}:
>>>  1.0 1.0001
>>>  1.0002  1.0003
>>>
>>>
>>>
>>> But then add @parallel in the for loop
>>>
>>> A = [[1.0 1.0001];[1.0002 1.0003]]
>>> z = A
>>> tic()
>>> @parallel for i in 1:10
>>> z *= A
>>> end
>>> toc()
>>> A
>>>
>>> and get 
>>>
>>> elapsed time: 0.008912282 seconds
>>>
>>> 2x2 Array{Float64,2}:
>>>  1.0 1.0001
>>>  1.0002  1.0003
>>>
>>>
>>> look at the elapsed time differences! And I'm running this on my Xeon 
>>> desktop, not even a cluster
>>> Of course A-B reports
>>>
>>> 2x2 Array{Float64,2}:
>>>  0.0  0.0
>>>  0.0  0.0
>>>
>>>
>>> So is this what one should expect from this kind of simple 
>>> paralleizations? If so, I'm definitely *in love* with Julia :):):)
>>>
>>> Best,
>>>
>>> Ferran.
>>>
>>>
>>>

[julia-users] Re: I can't believe this spped-up !

2016-07-21 Thread Ferran Mazzanti
Nathan,

the execution of these two functions gives essentially the same timings, no 
matter of many processes I have added with addprocs()
Very surprising to me...
Of course I prefer the speeded-up version :)

Best,

Ferran.

On Thursday, July 21, 2016 at 6:40:14 PM UTC+2, Nathan Smith wrote:
>
> Try comparing these two function:
>
> function serial_example()
> A = [[1.0 1.001];[1.002 1.003]
> z = A 
> for i in 1:10
> z *= A
> end
> return z
> end
>
> function parallel_example()
> A = [[1.0 1.001]; [1.002 1.003]]
> z = @parallel (*) for i in 1:10
> A
> end
> return z
> end
>
>

[julia-users] Re: I can't believe this spped-up !

2016-07-21 Thread Ferran Mazzanti
Hi Nathan,

I posted the codes, so you can check if they do the same thing or not. 
These went to separate cells in Jupyter, nothing more and nothing less.
Not even a single line I didn't post. And yes I understand your line of 
reasoning, so that's why I got astonished also.
But I can see what is making this huge difference, and I'd like to know :)

Best,

Ferran.

On Thursday, July 21, 2016 at 6:31:57 PM UTC+2, Nathan Smith wrote:
>
> Hey Ferran, 
>
> You should be suspicious when your apparent speed up surpasses the level 
> of parallelism available on your CPU. I looks like your codes don't 
> actually compute the same thing.
>
> I'm assuming you're trying to compute the matrix exponential of A 
> (A^10) by repeatedly multiplying A. In your parallel code, each 
> process gets a local copy of 'z' and
> uses that. This means each process is computing something like 
> (A^(10/# of procs)). Check out this 
> 
>  section 
> of the documentation on parallel map and loops to see what I mean.
>
> That said, that doesn't explain your speed up completely, you should also 
> make sure that each part of your script is wrapped in a function and that 
> you 'warm-up' each function by running it once before comparing.
>
> Cheers, 
> Nathan
>
> On Thursday, 21 July 2016 12:00:47 UTC-4, Ferran Mazzanti wrote:
>>
>> Hi,
>>
>> mostly showing my astonishment, but I can even understand the figures in 
>> this stupid parallelization code
>>
>> A = [[1.0 1.0001];[1.0002 1.0003]]
>> z = A
>> tic()
>> for i in 1:10
>> z *= A
>> end
>> toc()
>> A
>>
>> produces
>>
>> elapsed time: 105.458639263 seconds
>>
>> 2x2 Array{Float64,2}:
>>  1.0 1.0001
>>  1.0002  1.0003
>>
>>
>>
>> But then add @parallel in the for loop
>>
>> A = [[1.0 1.0001];[1.0002 1.0003]]
>> z = A
>> tic()
>> @parallel for i in 1:10
>> z *= A
>> end
>> toc()
>> A
>>
>> and get 
>>
>> elapsed time: 0.008912282 seconds
>>
>> 2x2 Array{Float64,2}:
>>  1.0 1.0001
>>  1.0002  1.0003
>>
>>
>> look at the elapsed time differences! And I'm running this on my Xeon 
>> desktop, not even a cluster
>> Of course A-B reports
>>
>> 2x2 Array{Float64,2}:
>>  0.0  0.0
>>  0.0  0.0
>>
>>
>> So is this what one should expect from this kind of simple 
>> paralleizations? If so, I'm definitely *in love* with Julia :):):)
>>
>> Best,
>>
>> Ferran.
>>
>>
>>

[julia-users] Re: I can't believe this spped-up !

2016-07-21 Thread Ferran Mazzanti
I posted this because I also find the results... astonishingly surprising. 
Howeverm the timings are apparently real, as the first one took more than 
1.5mins on my wrist watch, and the second calculation was instantly.
And no, no function wrapping whatsoever...

On Thursday, July 21, 2016 at 6:22:50 PM UTC+2, Chris Rackauckas wrote:
>
> I wouldn't expect that much of a change unless you have a whole lot of 
> cores (even then, wouldn't expect this much of a change).
>
> Is this wrapped in a function when you're timing it?
>
> On Thursday, July 21, 2016 at 9:00:47 AM UTC-7, Ferran Mazzanti wrote:
>>
>> Hi,
>>
>> mostly showing my astonishment, but I can even understand the figures in 
>> this stupid parallelization code
>>
>> A = [[1.0 1.0001];[1.0002 1.0003]]
>> z = A
>> tic()
>> for i in 1:10
>> z *= A
>> end
>> toc()
>> A
>>
>> produces
>>
>> elapsed time: 105.458639263 seconds
>>
>> 2x2 Array{Float64,2}:
>>  1.0 1.0001
>>  1.0002  1.0003
>>
>>
>>
>> But then add @parallel in the for loop
>>
>> A = [[1.0 1.0001];[1.0002 1.0003]]
>> z = A
>> tic()
>> @parallel for i in 1:10
>> z *= A
>> end
>> toc()
>> A
>>
>> and get 
>>
>> elapsed time: 0.008912282 seconds
>>
>> 2x2 Array{Float64,2}:
>>  1.0 1.0001
>>  1.0002  1.0003
>>
>>
>> look at the elapsed time differences! And I'm running this on my Xeon 
>> desktop, not even a cluster
>> Of course A-B reports
>>
>> 2x2 Array{Float64,2}:
>>  0.0  0.0
>>  0.0  0.0
>>
>>
>> So is this what one should expect from this kind of simple 
>> paralleizations? If so, I'm definitely *in love* with Julia :):):)
>>
>> Best,
>>
>> Ferran.
>>
>>
>>

[julia-users] Re: I can't believe this spped-up !

2016-07-21 Thread Nathan Smith
Try comparing these two function:

function serial_example()
A = [[1.0 1.001];[1.002 1.003]
z = A 
for i in 1:10
z *= A
end
return z
end

function parallel_example()
A = [[1.0 1.001]; [1.002 1.003]]
z = @parallel (*) for i in 1:10
A
end
return z
end



[julia-users] Re: I can't believe this spped-up !

2016-07-21 Thread Nathan Smith
Hey Ferran, 

You should be suspicious when your apparent speed up surpasses the level of 
parallelism available on your CPU. I looks like your codes don't actually 
compute the same thing.

I'm assuming you're trying to compute the matrix exponential of A 
(A^10) by repeatedly multiplying A. In your parallel code, each 
process gets a local copy of 'z' and
uses that. This means each process is computing something like 
(A^(10/# of procs)). Check out this 

 section 
of the documentation on parallel map and loops to see what I mean.

That said, that doesn't explain your speed up completely, you should also 
make sure that each part of your script is wrapped in a function and that 
you 'warm-up' each function by running it once before comparing.

Cheers, 
Nathan

On Thursday, 21 July 2016 12:00:47 UTC-4, Ferran Mazzanti wrote:
>
> Hi,
>
> mostly showing my astonishment, but I can even understand the figures in 
> this stupid parallelization code
>
> A = [[1.0 1.0001];[1.0002 1.0003]]
> z = A
> tic()
> for i in 1:10
> z *= A
> end
> toc()
> A
>
> produces
>
> elapsed time: 105.458639263 seconds
>
> 2x2 Array{Float64,2}:
>  1.0 1.0001
>  1.0002  1.0003
>
>
>
> But then add @parallel in the for loop
>
> A = [[1.0 1.0001];[1.0002 1.0003]]
> z = A
> tic()
> @parallel for i in 1:10
> z *= A
> end
> toc()
> A
>
> and get 
>
> elapsed time: 0.008912282 seconds
>
> 2x2 Array{Float64,2}:
>  1.0 1.0001
>  1.0002  1.0003
>
>
> look at the elapsed time differences! And I'm running this on my Xeon 
> desktop, not even a cluster
> Of course A-B reports
>
> 2x2 Array{Float64,2}:
>  0.0  0.0
>  0.0  0.0
>
>
> So is this what one should expect from this kind of simple 
> paralleizations? If so, I'm definitely *in love* with Julia :):):)
>
> Best,
>
> Ferran.
>
>
>

[julia-users] Re: I can't believe this spped-up !

2016-07-21 Thread Chris Rackauckas
I wouldn't expect that much of a change unless you have a whole lot of 
cores (even then, wouldn't expect this much of a change).

Is this wrapped in a function when you're timing it?

On Thursday, July 21, 2016 at 9:00:47 AM UTC-7, Ferran Mazzanti wrote:
>
> Hi,
>
> mostly showing my astonishment, but I can even understand the figures in 
> this stupid parallelization code
>
> A = [[1.0 1.0001];[1.0002 1.0003]]
> z = A
> tic()
> for i in 1:10
> z *= A
> end
> toc()
> A
>
> produces
>
> elapsed time: 105.458639263 seconds
>
> 2x2 Array{Float64,2}:
>  1.0 1.0001
>  1.0002  1.0003
>
>
>
> But then add @parallel in the for loop
>
> A = [[1.0 1.0001];[1.0002 1.0003]]
> z = A
> tic()
> @parallel for i in 1:10
> z *= A
> end
> toc()
> A
>
> and get 
>
> elapsed time: 0.008912282 seconds
>
> 2x2 Array{Float64,2}:
>  1.0 1.0001
>  1.0002  1.0003
>
>
> look at the elapsed time differences! And I'm running this on my Xeon 
> desktop, not even a cluster
> Of course A-B reports
>
> 2x2 Array{Float64,2}:
>  0.0  0.0
>  0.0  0.0
>
>
> So is this what one should expect from this kind of simple 
> paralleizations? If so, I'm definitely *in love* with Julia :):):)
>
> Best,
>
> Ferran.
>
>
>