Re: [julia-users] Embarrassingly parallel workload

2016-01-30 Thread Christopher Alexander
I have tried the construction below with no success.  In v0.4.3, I end up 
getting a segmentation fault.  In the latest v.0.5.0, the run time is 3-4x 
as long as the non-parallelized version and the array constructed is vastly 
different than the one that is constructed using the non-parallelized code. 
 Below is the C++ code that I am essentially trying to emulate:

void TreeLattice::stepback(Size i, const Array& values,

 Array& newValues) const {

#pragma omp parallel for

for (Size j=0; jimpl().size(i); j++) {

Real value = 0.0;

for (Size l=0; limpl().probability(i,j,l) *

 values[this->impl().descendant(i,j,l)];

}

value *= this->impl().discount(i,j);

newValues[j] = value;

}

}

The calls to probability, descendant, and discount all end up accessing 
data in other objects, so I tried to prepend those function and type 
definitions with @everywhere.  However, that started me on a long chain of 
having to eventually wrap each file in my module in @everywhere, and there 
were still errors complaining about things not being defined.  At this 
point I am really confused as to how to construct what would appear to be a 
rather simple parallelized for loop that generates the same results as 
non-parallelized code.  I've poured over both this forum and other 
resources, and nothing has really worked.

Any help would be appreciated.

Thanks!

Chris


On Thursday, August 20, 2015 at 4:52:52 AM UTC-4, Nils Gudat wrote:
>
> Sebastian, I'm not sure I understand you correctly, but point (1) in your 
> list can usually be taken care of by wrapping all the necessary 
> usings/requires/includes and definitions in a @everywhere begin ... end 
> block.
>
> Julio, as for your original problem, I think Tim's advice about 
> SharedArrays was perfectly reasonable. Without having looked at your 
> problem in detail, I think you should be able to do something like this 
> (and I also think this gets close enough to what Sebastian was talking 
> about, and to Matlab's parfor, unless I'm completely misunderstanding your 
> problem):
>
> nprocs()==CPU_CORES || addprocs(CPU_CORES-1)
> results = SharedArray(Float64, (m,n))
>
> @sync @parallel for i = 1:n
> results[:, i] = complicatedfunction(inputs[i])
> end
>


Re: [julia-users] Embarrassingly parallel workload

2016-01-30 Thread Christopher Alexander
By "construction below", I mean this:

results = SharedArray(Float64, (m,n))

@sync @parallel for i = 1:n
results[:, i] = complicatedfunction(inputs[i])
end

On Saturday, January 30, 2016 at 2:31:40 PM UTC-5, Christopher Alexander 
wrote:
>
> I have tried the construction below with no success.  In v0.4.3, I end up 
> getting a segmentation fault.  In the latest v.0.5.0, the run time is 3-4x 
> as long as the non-parallelized version and the array constructed is vastly 
> different than the one that is constructed using the non-parallelized code. 
>  Below is the C++ code that I am essentially trying to emulate:
>
> void TreeLattice::stepback(Size i, const Array& values,
>
>  Array& newValues) const {
>
> #pragma omp parallel for
>
> for (Size j=0; jimpl().size(i); j++) {
>
> Real value = 0.0;
>
> for (Size l=0; l
> value += this->impl().probability(i,j,l) *
>
>  values[this->impl().descendant(i,j,l)];
>
> }
>
> value *= this->impl().discount(i,j);
>
> newValues[j] = value;
>
> }
>
> }
>
> The calls to probability, descendant, and discount all end up accessing 
> data in other objects, so I tried to prepend those function and type 
> definitions with @everywhere.  However, that started me on a long chain of 
> having to eventually wrap each file in my module in @everywhere, and there 
> were still errors complaining about things not being defined.  At this 
> point I am really confused as to how to construct what would appear to be a 
> rather simple parallelized for loop that generates the same results as 
> non-parallelized code.  I've poured over both this forum and other 
> resources, and nothing has really worked.
>
> Any help would be appreciated.
>
> Thanks!
>
> Chris
>
>
> On Thursday, August 20, 2015 at 4:52:52 AM UTC-4, Nils Gudat wrote:
>>
>> Sebastian, I'm not sure I understand you correctly, but point (1) in your 
>> list can usually be taken care of by wrapping all the necessary 
>> usings/requires/includes and definitions in a @everywhere begin ... end 
>> block.
>>
>> Julio, as for your original problem, I think Tim's advice about 
>> SharedArrays was perfectly reasonable. Without having looked at your 
>> problem in detail, I think you should be able to do something like this 
>> (and I also think this gets close enough to what Sebastian was talking 
>> about, and to Matlab's parfor, unless I'm completely misunderstanding your 
>> problem):
>>
>> nprocs()==CPU_CORES || addprocs(CPU_CORES-1)
>> results = SharedArray(Float64, (m,n))
>>
>> @sync @parallel for i = 1:n
>> results[:, i] = complicatedfunction(inputs[i])
>> end
>>
>

Re: [julia-users] Embarrassingly parallel workload

2015-08-20 Thread Nils Gudat
Sebastian, I'm not sure I understand you correctly, but point (1) in your 
list can usually be taken care of by wrapping all the necessary 
usings/requires/includes and definitions in a @everywhere begin ... end 
block.

Julio, as for your original problem, I think Tim's advice about 
SharedArrays was perfectly reasonable. Without having looked at your 
problem in detail, I think you should be able to do something like this 
(and I also think this gets close enough to what Sebastian was talking 
about, and to Matlab's parfor, unless I'm completely misunderstanding your 
problem):

nprocs()==CPU_CORES || addprocs(CPU_CORES-1)
results = SharedArray(Float64, (m,n))

@sync @parallel for i = 1:n
results[:, i] = complicatedfunction(inputs[i])
end


Re: [julia-users] Embarrassingly parallel workload

2015-08-20 Thread Ryan Cox

Sebastian,

This talk from JuliaCon 2015 discusses progress on OpenMP-like threading:
Kiran Pamnany and Ranjan Anantharaman: Multi-threading Julia: 
http://youtu.be/GvLhseZ4D8M?a


Ryan

On 08/19/2015 02:42 PM, Sebastian Nowozin wrote:


Hi Julio,

I believe this is a very common type of workload, especially in 
scientific computing.
In C++ one can use OpenMP for this type of computation, in Matlab 
there is parfor.  From the users perspective both just work.


In Julia, I have not found an easy and convenient way to do such 
computation.
The difficulties I have experienced, trying to do this with 
distributed arrays and the Julia parallel operations:


1. Having to prepend @parallel before every import/require so that all 
parallel processes have all definitions.
2. Working with the distributed arrays API has given me plenty of 
headaches; it is more like programming with local/global contexts in 
OpenCL.
3. (I believe this is fixed now.) There were garbage collection issues 
and crashes on Windows when using distributed arrays.


What would be very convenient is a type of OpenMP like parallelism, 
really anything that can enable us to write simply


function compute(X::Vector{Float64}, theta)
n = length(X)
A = zeros(Float64, n)
@smp for i=1:n
A[i] = embarassing_parallel(X[i], theta);
end
A
end

Where @smp would correspond to #pragma omp parallel for.
I know this may be difficult to implement for a language as dynamic as 
Julia, but it is hard to argue against this simplicity from the users' 
point of view.


As Clang/LLVM now support OpenMP (https://clang-omp.github.io/), one 
perhaps can recycle the same OpenMP runtime for such lightweight 
parallelism?


Thanks,
Sebastian


On Wednesday, 19 August 2015 19:03:59 UTC+1, Júlio Hoffimann wrote:

Hi Kristoffer, sorry for the delay and thanks for the code.

What I want to do is very simple: I have an expensive loop for
i=1:N such that each iteration is independent and produces a large
array of size M. The result of this loop is a matrix of size MxN.
I have many CPU cores at my disposal and want to distribute this
work.

In the past I accomplished that with MPI in Python:
https://github.com/juliohm/HUM/blob/master/pyhum/utils.py
https://github.com/juliohm/HUM/blob/master/pyhum/utils.py
Whenever a process in the pool is free it consumes an iteration
of the loop. What exactly the @parallel macro in Julia is doing?
How can I modify the code I previously posted in this thread to
achieve such effect?

-Júlio





Re: [julia-users] Embarrassingly parallel workload

2015-08-20 Thread Sebastian Nowozin
Thanks Ryan for the pointer, this is awesome work, I am looking forward to
this becoming part of the Julia release in Q3.

Sebastian


On Thu, Aug 20, 2015 at 3:34 PM, Ryan Cox ryan_...@byu.edu wrote:

 Sebastian,

 This talk from JuliaCon 2015 discusses progress on OpenMP-like threading:
 Kiran Pamnany and Ranjan Anantharaman: Multi-threading Julia:
 http://youtu.be/GvLhseZ4D8M?ahttp://youtu.be/GvLhseZ4D8M?a

 Ryan


 On 08/19/2015 02:42 PM, Sebastian Nowozin wrote:


 Hi Julio,

 I believe this is a very common type of workload, especially in scientific
 computing.
 In C++ one can use OpenMP for this type of computation, in Matlab there is
 parfor.  From the users perspective both just work.

 In Julia, I have not found an easy and convenient way to do such
 computation.
 The difficulties I have experienced, trying to do this with distributed
 arrays and the Julia parallel operations:

 1. Having to prepend @parallel before every import/require so that all
 parallel processes have all definitions.
 2. Working with the distributed arrays API has given me plenty of
 headaches; it is more like programming with local/global contexts in OpenCL.
 3. (I believe this is fixed now.) There were garbage collection issues and
 crashes on Windows when using distributed arrays.

 What would be very convenient is a type of OpenMP like parallelism, really
 anything that can enable us to write simply

 function compute(X::Vector{Float64}, theta)
 n = length(X)
 A = zeros(Float64, n)
 @smp for i=1:n
 A[i] = embarassing_parallel(X[i], theta);
 end
 A
 end

 Where @smp would correspond to #pragma omp parallel for.
 I know this may be difficult to implement for a language as dynamic as
 Julia, but it is hard to argue against this simplicity from the users'
 point of view.

 As Clang/LLVM now support OpenMP (https://clang-omp.github.io/), one
 perhaps can recycle the same OpenMP runtime for such lightweight
 parallelism?

 Thanks,
 Sebastian


 On Wednesday, 19 August 2015 19:03:59 UTC+1, Júlio Hoffimann wrote:

 Hi Kristoffer, sorry for the delay and thanks for the code.

 What I want to do is very simple: I have an expensive loop for i=1:N such
 that each iteration is independent and produces a large array of size M.
 The result of this loop is a matrix of size MxN. I have many CPU cores at
 my disposal and want to distribute this work.

 In the past I accomplished that with MPI in Python:
 https://github.com/juliohm/HUM/blob/master/pyhum/utils.py
 https://github.com/juliohm/HUM/blob/master/pyhum/utils.py Whenever a
 process in the pool is free it consumes an iteration of the loop. What
 exactly the @parallel macro in Julia is doing? How can I modify the code I
 previously posted in this thread to achieve such effect?

 -Júlio





Re: [julia-users] Embarrassingly parallel workload

2015-08-19 Thread Júlio Hoffimann
Hi Kristoffer, sorry for the delay and thanks for the code.

What I want to do is very simple: I have an expensive loop for i=1:N such
that each iteration is independent and produces a large array of size M.
The result of this loop is a matrix of size MxN. I have many CPU cores at
my disposal and want to distribute this work.

In the past I accomplished that with MPI in Python:
https://github.com/juliohm/HUM/blob/master/pyhum/utils.py Whenever a
process in the pool is free it consumes an iteration of the loop. What
exactly the @parallel macro in Julia is doing? How can I modify the code I
previously posted in this thread to achieve such effect?

-Júlio


Re: [julia-users] Embarrassingly parallel workload

2015-08-19 Thread Júlio Hoffimann
Hi Ismael,

MPI is distributed memory, I'm trying to use all the cores in my single
workstation with shared memory instead. Thanks for the link anyways.

-Júlio


Re: [julia-users] Embarrassingly parallel workload

2015-08-19 Thread Ismael VC
There is an MPI wrapper for Julia, I don't know if it'll suit your needs 
thoug:

https://github.com/JuliaParallel/MPI.jl

El miércoles, 19 de agosto de 2015, 13:03:59 (UTC-5), Júlio Hoffimann 
escribió:

 Hi Kristoffer, sorry for the delay and thanks for the code.

 What I want to do is very simple: I have an expensive loop for i=1:N such 
 that each iteration is independent and produces a large array of size M. 
 The result of this loop is a matrix of size MxN. I have many CPU cores at 
 my disposal and want to distribute this work.

 In the past I accomplished that with MPI in Python: 
 https://github.com/juliohm/HUM/blob/master/pyhum/utils.py Whenever a 
 process in the pool is free it consumes an iteration of the loop. What 
 exactly the @parallel macro in Julia is doing? How can I modify the code I 
 previously posted in this thread to achieve such effect?

 -Júlio



Re: [julia-users] Embarrassingly parallel workload

2015-08-19 Thread Sebastian Nowozin

Hi Julio,

I believe this is a very common type of workload, especially in scientific 
computing.
In C++ one can use OpenMP for this type of computation, in Matlab there is 
parfor.  From the users perspective both just work.

In Julia, I have not found an easy and convenient way to do such 
computation.
The difficulties I have experienced, trying to do this with distributed 
arrays and the Julia parallel operations:

1. Having to prepend @parallel before every import/require so that all 
parallel processes have all definitions.
2. Working with the distributed arrays API has given me plenty of 
headaches; it is more like programming with local/global contexts in OpenCL.
3. (I believe this is fixed now.) There were garbage collection issues and 
crashes on Windows when using distributed arrays.

What would be very convenient is a type of OpenMP like parallelism, really 
anything that can enable us to write simply

function compute(X::Vector{Float64}, theta)
n = length(X)
A = zeros(Float64, n)
@smp for i=1:n
A[i] = embarassing_parallel(X[i], theta);
end
A
end

Where @smp would correspond to #pragma omp parallel for.
I know this may be difficult to implement for a language as dynamic as 
Julia, but it is hard to argue against this simplicity from the users' 
point of view.

As Clang/LLVM now support OpenMP (https://clang-omp.github.io/), one 
perhaps can recycle the same OpenMP runtime for such lightweight 
parallelism?

Thanks,
Sebastian


On Wednesday, 19 August 2015 19:03:59 UTC+1, Júlio Hoffimann wrote:

 Hi Kristoffer, sorry for the delay and thanks for the code.

 What I want to do is very simple: I have an expensive loop for i=1:N such 
 that each iteration is independent and produces a large array of size M. 
 The result of this loop is a matrix of size MxN. I have many CPU cores at 
 my disposal and want to distribute this work.

 In the past I accomplished that with MPI in Python: 
 https://github.com/juliohm/HUM/blob/master/pyhum/utils.py Whenever a 
 process in the pool is free it consumes an iteration of the loop. What 
 exactly the @parallel macro in Julia is doing? How can I modify the code I 
 previously posted in this thread to achieve such effect?

 -Júlio



Re: [julia-users] Embarrassingly parallel workload

2015-08-19 Thread Júlio Hoffimann
Hi Sebastian, thanks for sharing your experience in parallelizing Julia
code. I used OpenMP in the past too, it was very convenient in my C++
codebase. I remember of an initiative OpenACC that was trying to bring
OpenMP and GPU accelerators together, I don't know the current status of
it. It may be of interest to Julia devs.

-Júlio


Re: [julia-users] Embarrassingly parallel workload

2015-08-10 Thread Júlio Hoffimann
What am I doing wrong in the following code?

function foo(N; parallel=false)
  if parallel  nprocs()  CPU_CORES
addprocs(CPU_CORES - nprocs())
  end

  result = SharedArray(Float64, 9, N)
  @parallel for i=1:N
sleep(1)
result[:,i] = rand(3,3)[:]
  end

  result
end

If I call foo(60, parallel=true), result is all zeros. Expected behavior is
a random matrix.

-Júlio


Re: [julia-users] Embarrassingly parallel workload

2015-08-10 Thread Kristoffer Carlsson
Something like this?

@everywhere function fill(A::SharedArray)
for idx in Base.localindexes(A)
A[idx] = rand()
end
end

function fill_array(m, n)
A = SharedArray(Float64, (m, n))
@sync begin
for p in procs(q)
@async remotecall_wait(p, fill, A)
end
end
A
end

fill_array(9, 60)



On Tuesday, August 11, 2015 at 12:23:02 AM UTC+2, Júlio Hoffimann wrote:

 What am I doing wrong in the following code?

 function foo(N; parallel=false)
   if parallel  nprocs()  CPU_CORES
 addprocs(CPU_CORES - nprocs())
   end

   result = SharedArray(Float64, 9, N)
   @parallel for i=1:N
 sleep(1)
 result[:,i] = rand(3,3)[:]
   end

   result
 end

 If I call foo(60, parallel=true), result is all zeros. Expected behavior 
 is a random matrix.

 -Júlio



[julia-users] Embarrassingly parallel workload

2015-08-09 Thread Júlio Hoffimann
Hi,

Suppose I have a complicated but embarrassingly parallel loop, 
namely: 
https://github.com/juliohm/ImageQuilting.jl/blob/master/src/iqsim.jl#L167

How would you dispatch the iterations so that all cores in the client 
computer are busy working? There is any construct in the language for that 
usual scenario?

-Júlio


Re: [julia-users] Embarrassingly parallel workload

2015-08-09 Thread Tim Holy
Completely possible with SharedArrays, see 
http://docs.julialang.org/en/latest/manual/parallel-computing/#shared-arrays

--Tim

On Sunday, August 09, 2015 02:41:42 PM Júlio Hoffimann wrote:
 Hi,
 
 Suppose I have a complicated but embarrassingly parallel loop,
 namely:
 https://github.com/juliohm/ImageQuilting.jl/blob/master/src/iqsim.jl#L167
 
 How would you dispatch the iterations so that all cores in the client
 computer are busy working? There is any construct in the language for that
 usual scenario?
 
 -Júlio



Re: [julia-users] Embarrassingly parallel workload

2015-08-09 Thread Júlio Hoffimann
Consider the following simplified example. There is an algorithm
implemented as a function foo(N). This algorithm repeats the same recipe N
times in a loop to fill in an array of arrays:

function foo(N)
  # bunch of auxiliary variables goes here
  # ...

  result = []
  for i=1:N
# complicated use of auxiliary variables to produce result[i]
# ...

push!(result, rand(3,3)) # push result[i] to the vector
  end

  return result
end

Each iteration is expensive and we would like to add an option foo(N;
parallel=false) to use all cores available. I would start with:

function foo(N; parallel=false)
  if parallel
addprocs(CPU_CORES - 1)
  end

# bunch of auxiliary variables goes here
# ...

  result = SharedVector(???)
  @parallel for i=1:N
# complicated use of auxiliary variables to produce result[i]
# ...

push!(result, rand(3,3))
  end

  return result
end

How to initialize the SharedVector? Is this the correct approach? What
about the auxiliary variables, will them be copied to all workers? If yes,
I can just replace Arrays by SharedArrays so that they are visible
everywhere without copying?

-Júlio


Re: [julia-users] Embarrassingly parallel workload

2015-08-09 Thread Júlio Hoffimann
Thank you Tim, will check it carefully.

-Júlio