I need to run something similar due to a large number of text files that I
have. They are too large to load into memory at one-time, let alone
multiple files at the same time. I find that pmap() works very well here.
First, you should wrap your for loop in a function. In general you should
block your code with functions in Julia. Second, can you provide a
determiner to the split function?
Third, you may not need 32 procs for this job. There's overhead associated
with parallel processing.
This stackoverflow post has some more information that might be useful:
http://stackoverflow.com/questions/21890893/reading-csv-in-julia-is-slow-compared-to-python/35120894?noredirect=1#comment66827279_35120894
On Thursday, October 13, 2016 at 11:45:36 PM UTC-4, love...@gmail.com wrote:
>
> I want to process each line of a large text file (100G) in parallel using
> the following code
>
> pmap(process_fun, eachline(the_file))
>
> however, it seems that pmap is slow. following is a dummy experiment:
>
> julia> writedlm("tmp.txt",rand(10,100)) # produce a large file
> julia> @time for l in eachline("tmp.txt")
> split(l)
> end
> 5.678517 seconds (11.00 M allocations: 732.637 MB, 40.67% gc time)
>
> julia> addprocs() # 32 core
>
> julia> @time map(split, eachline("tmp.txt"));
> 4.834571 seconds (11.00 M allocations: 734.638 MB, 32.84% gc time)
>
> julia> @time pmap(split, eachline("tmp.txt"));
> 112.275411 seconds (227.06 M allocations: 10.024 GB, 50.72% gc time)
>
> the goal is to process those files (300+) as fast as possible. and maybe
> there are better ways to call pmap?
>