In another thread I mentioned looking at some largish data sets (10's of gigabytes in the form of a .csv file). I have made some progress treating the file as a memory-mapped Uint8 array but that hasn't been as effective as I would have hoped. Using a shared array and multiple processes seems an effective way to parallelize the initial reduction of the .csv file.
The best way I have come up with of getting a file's contents as a shared array is sm = convert(SharedArray, open(readbytes,"./kaggle/trainHistory.csv")) It would be convenient to process the contents on line boundaries. I can determine suitable ranges with something like function blocks(v::SharedVector{Uint8}) np = length(v.pids) len = length(v) bsz = div(len,np) blks = Array(UnitRange{Int},np) low = 1 for i in 1:np-1 eolpos = findnext(v, '\n', i*bsz) blks[i] = UnitRange(low, eolpos) low = eolpos + 1 end blks[np] = UnitRange(low,len) blks end which in this case produces julia> blocks(sm) 8-element Array{UnitRange{Int64},1}: 1:794390 794391:1588775 1588776:2383151 2383152:3177538 3177539:3971942 3971943:4766322 4766323:5560686 5560687:6355060 (This is a smaller file that I am using for testing. The real files are much larger.) These blocks will be different from what I would get with sm.loc_subarray_1d. It seems to me that I should be able to use these blocks rather than the .loc_subarray_1d blocks if I do enough juggling with @spawnat, fetch, etc. Is there anything that would stand in the way of doing so?