I can do it using pmap, but it doesn't run any faster (still 3 workers):
julia> @time h10 = Calculus.hessian(myf, RB.troubleParamsSim1.raw)
 186.704029 seconds (410.30 k allocations: 29.503 MB, 0.01% gc time)

That was using this code:
"""
Return a function that takes one argument, the optimization parameters.
It evaluates the -LL in parallel
"""
function makeParallelLL()
    np = nworkers()
    cuts = chunks(obs[:id], np)
    function myLL(x)
        args = [(cuts[i, 1], cuts[i,2], x) for i = 1:np]
        ll= pmap(x->cutLL(x...), args)
        -sum(ll)
    end
end

________________________________
From: julia-users@googlegroups.com [julia-users@googlegroups.com] on behalf of 
Christopher Fisher [fishe...@miamioh.edu]
Sent: Monday, June 20, 2016 4:45 AM
To: julia-users
Subject: [julia-users] Re: performance of parallel code

Would using pmap() be suitable for your application? Usually when I use that I 
get slightly more than a 2X speed up with 4 cores.


On Monday, June 20, 2016 at 2:27:41 AM UTC-4, Boylan, Ross wrote:
I think I've taken steps to minimize parallel overhead by providing only one 
function call per process and passing really minimal arguments to the 
functions.  But the gains in speed don't seem commensurate with the number of 
processors.  I know that pure linear speedup is too much to hope for, but I 
suspect I'm doing something wrong--for example that large data is getting 
passed around despite my efforts.

All my code is defined inside a module, though I exercise it from the main REPL.

Single processor (times are representative of multiple samples and exclude 
burn-in runs):
julia> @time h=Calculus.hessian(RB.mylike, RB.troubleParamsSim1.raw)
 206.422562 seconds (2.43 G allocations: 83.867 GB, 10.14% gc time)

#with 3 workers
julia> myf = RB.makeParallelLL()  # code below
julia> @time h10 = Calculus.hessian(myf, RB.troubleParamsSim1.raw)
 182.567647 seconds (1.48 M allocations: 111.622 MB, 0.02% gc time)

#with 7 workers
julia> @time h10 = Calculus.hessian(myf, RB.troubleParamsSim1.raw)
  82.453033 seconds (3.43 M allocations: 259.838 MB, 0.08% gc time)

Any suggestions?  This on an 8 CPU VM; the underlying hardware has >>8 
processors.

Here's some of the code:

mylike(v) = -logitlite(RB.obs, v, 7)

"""
return a matrix whose columns are the start and the end of range of 
observations, inclusive.
these chunks are roughly equal sizes, and no id is  split between chunks.
n  target number of chunks, and thus of rows returned
id id for each row; generally one id may appear in several rows.
   all rows with the same id must be contiguous.
"""
function chunks(id, n::Int)
    nTot = length(id)
    s = nTot/n
    cuts = Vector{Int}(n)
    cuts[1] = 1
    for i = 1:(n-1)
        cut::Int = floor(s*i)
        while id[cut] == id[cut-1]
            cut += 1
        end
        cuts[i+1] = cut
    end
    stops = vcat(cuts[2:n]-1, nTot)
    hcat(cuts, stops)
end

"""
Parallel helper: evaluate likelihood over limited range
i0..i1, inclusive, is the range.
This is an effort to avoid passing obs between processes.
"""
function cutLL(i0::Int, i1::Int, x)
    logitlite(obs[i0:i1,:], x, 7)  #obs is a 6413x8 data.frame.  Maybe this is 
doing an extra copy?
end

"""
Return a function that takes one argument, the optimization parameters.
It evaluates the -LL in parallel
"""
function makeParallelLL()
    np = nworkers()
    cuts = chunks(obs[:id], np)
    function myLL(x)
       ll= @parallel (+) for i=1:np
            cutLL(cuts[i,1], cuts[i,2], x)
        end
        -ll
    end
end

It might be relevant that logitlite uses a by() to process groups of 
observations.

So I think the only thing I'm passing into the  workers is a function call, 
cutLL, 2 integers, and a 10-element Vector{Float64}, and the only thing going 
back is 1 Float64.

Or perhaps fact that I'm using an inner function def (myLL in the last function 
definition) is doing something like causing transmission of all variables in 
lexical scope?

Reply via email to