Re: [julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?

2015-08-25 Thread Andrei Zh
I submitted a new issue https://github.com/JuliaWeb/Requests.jl/issues/61 
to Requests.jl describing the problem and my observations. In short, each 
separate request takes a reasonable time, but when I launch a lot of them 
in tasks, they become very slow. 


On Tuesday, August 25, 2015 at 2:44:30 AM UTC+3, Andrei Zh wrote:

 @Jameson: setting UV_THREADPOOL_SIZE to 32 seems to reduce DNS resolution 
 time twice (from 26 to 12 seconds on my latest tests), so thank you. 

 However, DNS seems to be not the only root of the problem: I noticed that 
 with large number of URLs first ones get result very quickly (300-800ms), 
 but then latency begins to grow until timeout is exceeded (I'm seen more 
 then 100 seconds for some requests). 

 I will try to set up stable test and post it here as well as to issues in 
 Request.jl. 


 On Monday, August 24, 2015 at 8:21:51 PM UTC+3, Jameson wrote:

 If you are doing a lot of parallel dns queries, you may want to try 
 increasing the number that can be run simultaneously but setting the 
 UV_THREADPOOL_SIZE environment variable before starting julia to something 
 larger (default is 4, max is 128).

 On Mon, Aug 24, 2015 at 9:17 AM Andrei Zh faithle...@gmail.com wrote:

 Jonathan, thanks for your support. So far I noticed that DNS gives 
 pretty large delay. E.g. resolving IP addresses for 1000 URLs took 80 
 seconds in serial code and 26 seconds in muli-task code: 


 Serial execution: 

 julia @time for url in urls
begin
Base.getaddrinfo(URI(url).host)
end
end
 elapsed time: 80.071810293 seconds (732400 bytes allocated)


 Multitask execution:


 julia @time @sync for url in urls

@async begin   
Base.getaddrinfo(URI(url).host)
end
end

 elapsed time: 26.241893516 seconds (4277968 bytes allocated)

 So I'll try to pre-resolve IPs and test again. 


 On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote:

 As one of the maintainers of Requests.jl, I'm especially interested in 
 its use for high-performance applications so don't hesitate to file an 
 issue if it gives you any performance problems.

 On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote:

 Hi Steven, 

 thanks for your answer! It turns out I misunderstood @async long time 
 ago, assuming it also makes a remote call to other processes and thus 
 introduces true multi-tasking. So now I need to rethink my approach 
 before 
 going further. 

 Just to clarify: my goal is to perform as many requests as possible at 
 the same time, so I want to use both - multiple processes (to start 
 several 
 requests at several cores in parallel) and tasks (to launch new requests 
 while old ones are still waiting for IO to complete). 

 So I will update my approach and come back with results or new 
 questions. 



 On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson 
 wrote:

 @parallel in Julia is for executing separate parallel processes (true 
 multi-tasking, with separate address spaces).  @async is for tasks, 
 which 
 are green threads and represent cooperative multitasking (within the 
 same 
 process and the same address space).

 I/O in Julia is asynchronous — while one task is blocked waiting on 
 I/O, another task will wake up and start running.  (This is based on the 
 libuv library, which is designed for high-performance asynchronous I/O.)

 The first question is whether you want to fetch URLs in separate OS 
 processes, or you want to use green threads within the same process.  It 
 sounds like you want the latter, in which case @async is the right thing.

 The second question is whether something about the Requests.jl 
 package is serializing things somehow; for that you might file an issue 
 at 
 Requests.jl.



[julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?

2015-08-24 Thread Andrei Zh
Jonathan, thanks for your support. So far I noticed that DNS gives pretty 
large delay. E.g. resolving IP addresses for 1000 URLs took 80 seconds in 
serial code and 26 seconds in muli-task code: 


Serial execution: 

julia @time for url in urls
   begin
   Base.getaddrinfo(URI(url).host)
   end
   end
elapsed time: 80.071810293 seconds (732400 bytes allocated)


Multitask execution:


julia @time @sync for url in urls
   # sleep(0.01)   

  
   @async begin
   # t = @elapsed resp = get(url)   

 
   # println(Status: $(resp.status) ($(t) sec))   

 
   Base.getaddrinfo(URI(url).host)
   end
   end




elapsed time: 26.241893516 seconds (4277968 bytes allocated)

So I'll try to pre-resolve IPs and test again. 





On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote:

 As one of the maintainers of Requests.jl, I'm especially interested in its 
 use for high-performance applications so don't hesitate to file an issue if 
 it gives you any performance problems.

 On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote:

 Hi Steven, 

 thanks for your answer! It turns out I misunderstood @async long time 
 ago, assuming it also makes a remote call to other processes and thus 
 introduces true multi-tasking. So now I need to rethink my approach before 
 going further. 

 Just to clarify: my goal is to perform as many requests as possible at 
 the same time, so I want to use both - multiple processes (to start several 
 requests at several cores in parallel) and tasks (to launch new requests 
 while old ones are still waiting for IO to complete). 

 So I will update my approach and come back with results or new questions. 



 On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote:

 @parallel in Julia is for executing separate parallel processes (true 
 multi-tasking, with separate address spaces).  @async is for tasks, which 
 are green threads and represent cooperative multitasking (within the same 
 process and the same address space).

 I/O in Julia is asynchronous — while one task is blocked waiting on I/O, 
 another task will wake up and start running.  (This is based on the libuv 
 library, which is designed for high-performance asynchronous I/O.)

 The first question is whether you want to fetch URLs in separate OS 
 processes, or you want to use green threads within the same process.  It 
 sounds like you want the latter, in which case @async is the right thing.

 The second question is whether something about the Requests.jl package 
 is serializing things somehow; for that you might file an issue at 
 Requests.jl.



[julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?

2015-08-24 Thread Andrei Zh
Jonathan, thanks for your support. So far I noticed that DNS gives pretty 
large delay. E.g. resolving IP addresses for 1000 URLs took 80 seconds in 
serial code and 26 seconds in muli-task code: 


Serial execution: 

julia @time for url in urls
   begin
   Base.getaddrinfo(URI(url).host)
   end
   end
elapsed time: 80.071810293 seconds (732400 bytes allocated)


Multitask execution:


julia @time @sync for url in urls
   @async begin   
   Base.getaddrinfo(URI(url).host)
   end
   end

elapsed time: 26.241893516 seconds (4277968 bytes allocated)

So I'll try to pre-resolve IPs and test again. 


On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote:

 As one of the maintainers of Requests.jl, I'm especially interested in its 
 use for high-performance applications so don't hesitate to file an issue if 
 it gives you any performance problems.

 On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote:

 Hi Steven, 

 thanks for your answer! It turns out I misunderstood @async long time 
 ago, assuming it also makes a remote call to other processes and thus 
 introduces true multi-tasking. So now I need to rethink my approach before 
 going further. 

 Just to clarify: my goal is to perform as many requests as possible at 
 the same time, so I want to use both - multiple processes (to start several 
 requests at several cores in parallel) and tasks (to launch new requests 
 while old ones are still waiting for IO to complete). 

 So I will update my approach and come back with results or new questions. 



 On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote:

 @parallel in Julia is for executing separate parallel processes (true 
 multi-tasking, with separate address spaces).  @async is for tasks, which 
 are green threads and represent cooperative multitasking (within the same 
 process and the same address space).

 I/O in Julia is asynchronous — while one task is blocked waiting on I/O, 
 another task will wake up and start running.  (This is based on the libuv 
 library, which is designed for high-performance asynchronous I/O.)

 The first question is whether you want to fetch URLs in separate OS 
 processes, or you want to use green threads within the same process.  It 
 sounds like you want the latter, in which case @async is the right thing.

 The second question is whether something about the Requests.jl package 
 is serializing things somehow; for that you might file an issue at 
 Requests.jl.



[julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?

2015-08-24 Thread Jonathan Malmaud
As one of the maintainers of Requests.jl, I'm especially interested in its 
use for high-performance applications so don't hesitate to file an issue if 
it gives you any performance problems.

On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote:

 Hi Steven, 

 thanks for your answer! It turns out I misunderstood @async long time ago, 
 assuming it also makes a remote call to other processes and thus introduces 
 true multi-tasking. So now I need to rethink my approach before going 
 further. 

 Just to clarify: my goal is to perform as many requests as possible at the 
 same time, so I want to use both - multiple processes (to start several 
 requests at several cores in parallel) and tasks (to launch new requests 
 while old ones are still waiting for IO to complete). 

 So I will update my approach and come back with results or new questions. 



 On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote:

 @parallel in Julia is for executing separate parallel processes (true 
 multi-tasking, with separate address spaces).  @async is for tasks, which 
 are green threads and represent cooperative multitasking (within the same 
 process and the same address space).

 I/O in Julia is asynchronous — while one task is blocked waiting on I/O, 
 another task will wake up and start running.  (This is based on the libuv 
 library, which is designed for high-performance asynchronous I/O.)

 The first question is whether you want to fetch URLs in separate OS 
 processes, or you want to use green threads within the same process.  It 
 sounds like you want the latter, in which case @async is the right thing.

 The second question is whether something about the Requests.jl package is 
 serializing things somehow; for that you might file an issue at Requests.jl.



Re: [julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?

2015-08-24 Thread Jonathan Malmaud
Thanks for the report, Andrei - Would you mind filing this is an issue at
https://github.com/JuliaWeb/Requests.jl?

On Mon, Aug 24, 2015 at 9:17 AM, Andrei Zh faithlessfri...@gmail.com
wrote:

 Jonathan, thanks for your support. So far I noticed that DNS gives pretty
 large delay. E.g. resolving IP addresses for 1000 URLs took 80 seconds in
 serial code and 26 seconds in muli-task code:


 Serial execution:

 julia @time for url in urls
begin
Base.getaddrinfo(URI(url).host)
end
end
 elapsed time: 80.071810293 seconds (732400 bytes allocated)


 Multitask execution:


 julia @time @sync for url in urls
@async begin
Base.getaddrinfo(URI(url).host)
end
end

 elapsed time: 26.241893516 seconds (4277968 bytes allocated)

 So I'll try to pre-resolve IPs and test again.


 On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote:

 As one of the maintainers of Requests.jl, I'm especially interested in
 its use for high-performance applications so don't hesitate to file an
 issue if it gives you any performance problems.

 On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote:

 Hi Steven,

 thanks for your answer! It turns out I misunderstood @async long time
 ago, assuming it also makes a remote call to other processes and thus
 introduces true multi-tasking. So now I need to rethink my approach before
 going further.

 Just to clarify: my goal is to perform as many requests as possible at
 the same time, so I want to use both - multiple processes (to start several
 requests at several cores in parallel) and tasks (to launch new requests
 while old ones are still waiting for IO to complete).

 So I will update my approach and come back with results or new
 questions.



 On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote:

 @parallel in Julia is for executing separate parallel processes (true
 multi-tasking, with separate address spaces).  @async is for tasks, which
 are green threads and represent cooperative multitasking (within the same
 process and the same address space).

 I/O in Julia is asynchronous — while one task is blocked waiting on
 I/O, another task will wake up and start running.  (This is based on the
 libuv library, which is designed for high-performance asynchronous I/O.)

 The first question is whether you want to fetch URLs in separate OS
 processes, or you want to use green threads within the same process.  It
 sounds like you want the latter, in which case @async is the right thing.

 The second question is whether something about the Requests.jl package
 is serializing things somehow; for that you might file an issue at
 Requests.jl.




Re: [julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?

2015-08-24 Thread Seth
Name resolution delays are generally an issue with network latency. Trying 
to resolve 1000 uncached names will take a while on any system:

seth@schroeder:~$ time host www.julialang.org
www.julialang.org is an alias for julialang.github.io.
julialang.github.io is an alias for github.map.fastly.net.
github.map.fastly.net has address 23.235.47.133


real 0m3.268s
user 0m0.003s
sys 0m0.020s

Unless your server is localhost, you're going to have delays.

On Monday, August 24, 2015 at 6:49:47 AM UTC-7, Jonathan Malmaud wrote:

 Thanks for the report, Andrei - Would you mind filing this is an issue at 
 https://github.com/JuliaWeb/Requests.jl? 

 On Mon, Aug 24, 2015 at 9:17 AM, Andrei Zh faithle...@gmail.com 
 javascript: wrote:

 Jonathan, thanks for your support. So far I noticed that DNS gives pretty 
 large delay. E.g. resolving IP addresses for 1000 URLs took 80 seconds in 
 serial code and 26 seconds in muli-task code: 


 Serial execution: 

 julia @time for url in urls
begin
Base.getaddrinfo(URI(url).host)
end
end
 elapsed time: 80.071810293 seconds (732400 bytes allocated)


 Multitask execution:


 julia @time @sync for url in urls
@async begin   
Base.getaddrinfo(URI(url).host)
end
end

 elapsed time: 26.241893516 seconds (4277968 bytes allocated)

 So I'll try to pre-resolve IPs and test again. 


 On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote:

 As one of the maintainers of Requests.jl, I'm especially interested in 
 its use for high-performance applications so don't hesitate to file an 
 issue if it gives you any performance problems.

 On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote:

 Hi Steven, 

 thanks for your answer! It turns out I misunderstood @async long time 
 ago, assuming it also makes a remote call to other processes and thus 
 introduces true multi-tasking. So now I need to rethink my approach before 
 going further. 

 Just to clarify: my goal is to perform as many requests as possible at 
 the same time, so I want to use both - multiple processes (to start 
 several 
 requests at several cores in parallel) and tasks (to launch new requests 
 while old ones are still waiting for IO to complete). 

 So I will update my approach and come back with results or new 
 questions. 



 On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote:

 @parallel in Julia is for executing separate parallel processes (true 
 multi-tasking, with separate address spaces).  @async is for tasks, 
 which 
 are green threads and represent cooperative multitasking (within the 
 same 
 process and the same address space).

 I/O in Julia is asynchronous — while one task is blocked waiting on 
 I/O, another task will wake up and start running.  (This is based on the 
 libuv library, which is designed for high-performance asynchronous I/O.)

 The first question is whether you want to fetch URLs in separate OS 
 processes, or you want to use green threads within the same process.  It 
 sounds like you want the latter, in which case @async is the right thing.

 The second question is whether something about the Requests.jl package 
 is serializing things somehow; for that you might file an issue at 
 Requests.jl.




Re: [julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?

2015-08-24 Thread Uwe Fechner
Did you try the DNS servers from Google, e.g. 8.8.8.8 ?
I never saw a reply that needs more than one second.
(Well, in our university network.)

Am Montag, 24. August 2015 16:25:06 UTC+2 schrieb Seth:

 Name resolution delays are generally an issue with network latency. Trying 
 to resolve 1000 uncached names will take a while on any system:

 seth@schroeder:~$ time host www.julialang.org
 www.julialang.org is an alias for julialang.github.io.
 julialang.github.io is an alias for github.map.fastly.net.
 github.map.fastly.net has address 23.235.47.133


 real 0m3.268s
 user 0m0.003s
 sys 0m0.020s

 Unless your server is localhost, you're going to have delays.

 On Monday, August 24, 2015 at 6:49:47 AM UTC-7, Jonathan Malmaud wrote:

 Thanks for the report, Andrei - Would you mind filing this is an issue at 
 https://github.com/JuliaWeb/Requests.jl? 

 On Mon, Aug 24, 2015 at 9:17 AM, Andrei Zh faithle...@gmail.com wrote:

 Jonathan, thanks for your support. So far I noticed that DNS gives 
 pretty large delay. E.g. resolving IP addresses for 1000 URLs took 80 
 seconds in serial code and 26 seconds in muli-task code: 


 Serial execution: 

 julia @time for url in urls
begin
Base.getaddrinfo(URI(url).host)
end
end
 elapsed time: 80.071810293 seconds (732400 bytes allocated)


 Multitask execution:


 julia @time @sync for url in urls
@async begin   
Base.getaddrinfo(URI(url).host)
end
end

 elapsed time: 26.241893516 seconds (4277968 bytes allocated)

 So I'll try to pre-resolve IPs and test again. 


 On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote:

 As one of the maintainers of Requests.jl, I'm especially interested in 
 its use for high-performance applications so don't hesitate to file an 
 issue if it gives you any performance problems.

 On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote:

 Hi Steven, 

 thanks for your answer! It turns out I misunderstood @async long time 
 ago, assuming it also makes a remote call to other processes and thus 
 introduces true multi-tasking. So now I need to rethink my approach 
 before 
 going further. 

 Just to clarify: my goal is to perform as many requests as possible at 
 the same time, so I want to use both - multiple processes (to start 
 several 
 requests at several cores in parallel) and tasks (to launch new requests 
 while old ones are still waiting for IO to complete). 

 So I will update my approach and come back with results or new 
 questions. 



 On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson 
 wrote:

 @parallel in Julia is for executing separate parallel processes (true 
 multi-tasking, with separate address spaces).  @async is for tasks, 
 which 
 are green threads and represent cooperative multitasking (within the 
 same 
 process and the same address space).

 I/O in Julia is asynchronous — while one task is blocked waiting on 
 I/O, another task will wake up and start running.  (This is based on the 
 libuv library, which is designed for high-performance asynchronous I/O.)

 The first question is whether you want to fetch URLs in separate OS 
 processes, or you want to use green threads within the same process.  It 
 sounds like you want the latter, in which case @async is the right thing.

 The second question is whether something about the Requests.jl 
 package is serializing things somehow; for that you might file an issue 
 at 
 Requests.jl.




Re: [julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?

2015-08-24 Thread Seth
Uni networks are typically high bandwidth and low latency and are optimally 
peered. My home network, unfortunately, exhibits none of those traits :)

In any case, even a 70-millisecond resolution time results in 70 seconds of 
delay for 1000 uncached resolutions.

On Monday, August 24, 2015 at 7:42:16 AM UTC-7, Uwe Fechner wrote:

 Did you try the DNS servers from Google, e.g. 8.8.8.8 ?
 I never saw a reply that needs more than one second.
 (Well, in our university network.)

 Am Montag, 24. August 2015 16:25:06 UTC+2 schrieb Seth:

 Name resolution delays are generally an issue with network latency. 
 Trying to resolve 1000 uncached names will take a while on any system:

 seth@schroeder:~$ time host www.julialang.org
 www.julialang.org is an alias for julialang.github.io.
 julialang.github.io is an alias for github.map.fastly.net.
 github.map.fastly.net has address 23.235.47.133


 real 0m3.268s
 user 0m0.003s
 sys 0m0.020s

 Unless your server is localhost, you're going to have delays.

 On Monday, August 24, 2015 at 6:49:47 AM UTC-7, Jonathan Malmaud wrote:

 Thanks for the report, Andrei - Would you mind filing this is an issue 
 at https://github.com/JuliaWeb/Requests.jl? 

 On Mon, Aug 24, 2015 at 9:17 AM, Andrei Zh faithle...@gmail.com wrote:

 Jonathan, thanks for your support. So far I noticed that DNS gives 
 pretty large delay. E.g. resolving IP addresses for 1000 URLs took 80 
 seconds in serial code and 26 seconds in muli-task code: 


 Serial execution: 

 julia @time for url in urls
begin
Base.getaddrinfo(URI(url).host)
end
end
 elapsed time: 80.071810293 seconds (732400 bytes allocated)


 Multitask execution:


 julia @time @sync for url in urls
@async begin   
Base.getaddrinfo(URI(url).host)
end
end

 elapsed time: 26.241893516 seconds (4277968 bytes allocated)

 So I'll try to pre-resolve IPs and test again. 


 On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote:

 As one of the maintainers of Requests.jl, I'm especially interested in 
 its use for high-performance applications so don't hesitate to file an 
 issue if it gives you any performance problems.

 On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote:

 Hi Steven, 

 thanks for your answer! It turns out I misunderstood @async long time 
 ago, assuming it also makes a remote call to other processes and thus 
 introduces true multi-tasking. So now I need to rethink my approach 
 before 
 going further. 

 Just to clarify: my goal is to perform as many requests as possible 
 at the same time, so I want to use both - multiple processes (to start 
 several requests at several cores in parallel) and tasks (to launch new 
 requests while old ones are still waiting for IO to complete). 

 So I will update my approach and come back with results or new 
 questions. 



 On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson 
 wrote:

 @parallel in Julia is for executing separate parallel processes 
 (true multi-tasking, with separate address spaces).  @async is for 
 tasks, 
 which are green threads and represent cooperative multitasking 
 (within 
 the same process and the same address space).

 I/O in Julia is asynchronous — while one task is blocked waiting on 
 I/O, another task will wake up and start running.  (This is based on 
 the 
 libuv library, which is designed for high-performance asynchronous I/O.)

 The first question is whether you want to fetch URLs in separate OS 
 processes, or you want to use green threads within the same process.  
 It 
 sounds like you want the latter, in which case @async is the right 
 thing.

 The second question is whether something about the Requests.jl 
 package is serializing things somehow; for that you might file an issue 
 at 
 Requests.jl.




Re: [julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?

2015-08-24 Thread Jameson Nash
If you are doing a lot of parallel dns queries, you may want to try
increasing the number that can be run simultaneously but setting the
UV_THREADPOOL_SIZE environment variable before starting julia to something
larger (default is 4, max is 128).

On Mon, Aug 24, 2015 at 9:17 AM Andrei Zh faithlessfri...@gmail.com wrote:

 Jonathan, thanks for your support. So far I noticed that DNS gives pretty
 large delay. E.g. resolving IP addresses for 1000 URLs took 80 seconds in
 serial code and 26 seconds in muli-task code:


 Serial execution:

 julia @time for url in urls
begin
Base.getaddrinfo(URI(url).host)
end
end
 elapsed time: 80.071810293 seconds (732400 bytes allocated)


 Multitask execution:


 julia @time @sync for url in urls

@async begin
Base.getaddrinfo(URI(url).host)
end
end

 elapsed time: 26.241893516 seconds (4277968 bytes allocated)

 So I'll try to pre-resolve IPs and test again.


 On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote:

 As one of the maintainers of Requests.jl, I'm especially interested in
 its use for high-performance applications so don't hesitate to file an
 issue if it gives you any performance problems.

 On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote:

 Hi Steven,

 thanks for your answer! It turns out I misunderstood @async long time
 ago, assuming it also makes a remote call to other processes and thus
 introduces true multi-tasking. So now I need to rethink my approach before
 going further.

 Just to clarify: my goal is to perform as many requests as possible at
 the same time, so I want to use both - multiple processes (to start several
 requests at several cores in parallel) and tasks (to launch new requests
 while old ones are still waiting for IO to complete).

 So I will update my approach and come back with results or new
 questions.



 On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote:

 @parallel in Julia is for executing separate parallel processes (true
 multi-tasking, with separate address spaces).  @async is for tasks, which
 are green threads and represent cooperative multitasking (within the same
 process and the same address space).

 I/O in Julia is asynchronous — while one task is blocked waiting on
 I/O, another task will wake up and start running.  (This is based on the
 libuv library, which is designed for high-performance asynchronous I/O.)

 The first question is whether you want to fetch URLs in separate OS
 processes, or you want to use green threads within the same process.  It
 sounds like you want the latter, in which case @async is the right thing.

 The second question is whether something about the Requests.jl package
 is serializing things somehow; for that you might file an issue at
 Requests.jl.




Re: [julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?

2015-08-24 Thread Andrei Zh
@Jameson: setting UV_THREADPOOL_SIZE to 32 seems to reduce DNS resolution 
time twice (from 26 to 12 seconds on my latest tests), so thank you. 

However, DNS seems to be not the only root of the problem: I noticed that 
with large number of URLs first ones get result very quickly (300-800ms), 
but then latency begins to grow until timeout is exceeded (I'm seen more 
then 100 seconds for some requests). 

I will try to set up stable test and post it here as well as to issues in 
Request.jl. 


On Monday, August 24, 2015 at 8:21:51 PM UTC+3, Jameson wrote:

 If you are doing a lot of parallel dns queries, you may want to try 
 increasing the number that can be run simultaneously but setting the 
 UV_THREADPOOL_SIZE environment variable before starting julia to something 
 larger (default is 4, max is 128).

 On Mon, Aug 24, 2015 at 9:17 AM Andrei Zh faithle...@gmail.com 
 javascript: wrote:

 Jonathan, thanks for your support. So far I noticed that DNS gives pretty 
 large delay. E.g. resolving IP addresses for 1000 URLs took 80 seconds in 
 serial code and 26 seconds in muli-task code: 


 Serial execution: 

 julia @time for url in urls
begin
Base.getaddrinfo(URI(url).host)
end
end
 elapsed time: 80.071810293 seconds (732400 bytes allocated)


 Multitask execution:


 julia @time @sync for url in urls

@async begin   
Base.getaddrinfo(URI(url).host)
end
end

 elapsed time: 26.241893516 seconds (4277968 bytes allocated)

 So I'll try to pre-resolve IPs and test again. 


 On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote:

 As one of the maintainers of Requests.jl, I'm especially interested in 
 its use for high-performance applications so don't hesitate to file an 
 issue if it gives you any performance problems.

 On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote:

 Hi Steven, 

 thanks for your answer! It turns out I misunderstood @async long time 
 ago, assuming it also makes a remote call to other processes and thus 
 introduces true multi-tasking. So now I need to rethink my approach before 
 going further. 

 Just to clarify: my goal is to perform as many requests as possible at 
 the same time, so I want to use both - multiple processes (to start 
 several 
 requests at several cores in parallel) and tasks (to launch new requests 
 while old ones are still waiting for IO to complete). 

 So I will update my approach and come back with results or new 
 questions. 



 On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote:

 @parallel in Julia is for executing separate parallel processes (true 
 multi-tasking, with separate address spaces).  @async is for tasks, 
 which 
 are green threads and represent cooperative multitasking (within the 
 same 
 process and the same address space).

 I/O in Julia is asynchronous — while one task is blocked waiting on 
 I/O, another task will wake up and start running.  (This is based on the 
 libuv library, which is designed for high-performance asynchronous I/O.)

 The first question is whether you want to fetch URLs in separate OS 
 processes, or you want to use green threads within the same process.  It 
 sounds like you want the latter, in which case @async is the right thing.

 The second question is whether something about the Requests.jl package 
 is serializing things somehow; for that you might file an issue at 
 Requests.jl.



[julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?

2015-08-23 Thread Andrei Zh
More generally, is there anything like futures/callbacks or async IO in 
Julia? I couldn't find anything, but maybe I just don't know the right 
keywords. 


[julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?

2015-08-23 Thread Andrei Zh
Hi Steven, 

thanks for your answer! It turns out I misunderstood @async long time ago, 
assuming it also makes a remote call to other processes and thus introduces 
true multi-tasking. So now I need to rethink my approach before going 
further. 

Just to clarify: my goal is to perform as many requests as possible at the 
same time, so I want to use both - multiple processes (to start several 
requests at several cores in parallel) and tasks (to launch new requests 
while old ones are still waiting for IO to complete). 

So I will update my approach and come back with results or new questions. 



On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote:

 @parallel in Julia is for executing separate parallel processes (true 
 multi-tasking, with separate address spaces).  @async is for tasks, which 
 are green threads and represent cooperative multitasking (within the same 
 process and the same address space).

 I/O in Julia is asynchronous — while one task is blocked waiting on I/O, 
 another task will wake up and start running.  (This is based on the libuv 
 library, which is designed for high-performance asynchronous I/O.)

 The first question is whether you want to fetch URLs in separate OS 
 processes, or you want to use green threads within the same process.  It 
 sounds like you want the latter, in which case @async is the right thing.

 The second question is whether something about the Requests.jl package is 
 serializing things somehow; for that you might file an issue at Requests.jl.



[julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?

2015-08-23 Thread Steven G. Johnson
@parallel in Julia is for executing separate parallel processes (true 
multi-tasking, with separate address spaces).  @async is for tasks, which 
are green threads and represent cooperative multitasking (within the same 
process and the same address space).

I/O in Julia is asynchronous — while one task is blocked waiting on I/O, 
another task will wake up and start running.  (This is based on the libuv 
library, which is designed for high-performance asynchronous I/O.)

The first question is whether you want to fetch URLs in separate OS 
processes, or you want to use green threads within the same process.  It 
sounds like you want the latter, in which case @async is the right thing.

The second question is whether something about the Requests.jl package is 
serializing things somehow; for that you might file an issue at Requests.jl.


[julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?

2015-08-22 Thread Andrei Zh
BTW, I'm not sure I use @parallel correctly, so I tried to start tasks 
manually with @async too: 

@time @sync for url in urls
@async begin
resp = get(url)
println(Status: $(resp.status))
end
end

But I didn't notice any difference in performance. 


On Sunday, August 23, 2015 at 5:52:27 AM UTC+3, Andrei Zh wrote:

 I'm writing a kind of a web scanner that should retrieve and analyze about 
 100k URLs as fast as possible. Of course, it will take time anyway, but I'm 
 looking for how to utilize my CPUs and network as much as possible. 

 My initial approach was to add all available processors, pack urls into 
 tasks and run these tasks in parallel: 

 
 using Requests
 urls = ...
 @time @sync @parallel for url in urls
 resp = get(url)
 println(Status: $(resp.status))
 end

 My assumption was that 100k tasks would be created, each task would 
 execute GET request and, since this is IO operation, free current thread 
 for other tasks. From logs, however, I see that each worker executes tasks 
 one by one, every time waiting for GET request to finish. 

 So how do I start 100k requests in parallel? 

 (100k is here just for example, I can easily split then into chunks of 
 10k, for example, so system limits and overused CPU/network are not an 
 issue; issue is in their *underutilization*). 

 Thanks