In another thread I mentioned looking at some largish data sets (10's of 
gigabytes in the form of a .csv file).  I have made some progress treating 
the file as a memory-mapped Uint8 array but that hasn't been as effective 
as I would have hoped.  Using a shared array and multiple processes seems 
an effective way to parallelize the initial reduction of the .csv file.

The best way I have come up with of getting a file's contents as a shared 
array is

sm = convert(SharedArray, open(readbytes,"./kaggle/trainHistory.csv"))

It would be convenient to process the contents on line boundaries.  I can 
determine suitable ranges with something like

function blocks(v::SharedVector{Uint8})
    np = length(v.pids)
    len = length(v)
    bsz = div(len,np)
    blks = Array(UnitRange{Int},np)
    low = 1
    for i in 1:np-1
        eolpos = findnext(v, '\n', i*bsz)
        blks[i] = UnitRange(low, eolpos)
        low = eolpos + 1
    end
    blks[np] = UnitRange(low,len)
    blks
end

which in this case produces

julia> blocks(sm)
8-element Array{UnitRange{Int64},1}:
 1:794390       
 794391:1588775 
 1588776:2383151
 2383152:3177538
 3177539:3971942
 3971943:4766322
 4766323:5560686
 5560687:6355060


(This is a smaller file that I am using for testing.  The real files are 
much larger.)

These blocks will be different from what I would get with 
sm.loc_subarray_1d.  It seems to me that I should be able to use these 
blocks rather than the .loc_subarray_1d blocks if I do enough juggling with 
@spawnat, fetch, etc.  Is there anything that would stand in the way of 
doing so?

Reply via email to