Re: [julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?
I submitted a new issue https://github.com/JuliaWeb/Requests.jl/issues/61 to Requests.jl describing the problem and my observations. In short, each separate request takes a reasonable time, but when I launch a lot of them in tasks, they become very slow. On Tuesday, August 25, 2015 at 2:44:30 AM UTC+3, Andrei Zh wrote: @Jameson: setting UV_THREADPOOL_SIZE to 32 seems to reduce DNS resolution time twice (from 26 to 12 seconds on my latest tests), so thank you. However, DNS seems to be not the only root of the problem: I noticed that with large number of URLs first ones get result very quickly (300-800ms), but then latency begins to grow until timeout is exceeded (I'm seen more then 100 seconds for some requests). I will try to set up stable test and post it here as well as to issues in Request.jl. On Monday, August 24, 2015 at 8:21:51 PM UTC+3, Jameson wrote: If you are doing a lot of parallel dns queries, you may want to try increasing the number that can be run simultaneously but setting the UV_THREADPOOL_SIZE environment variable before starting julia to something larger (default is 4, max is 128). On Mon, Aug 24, 2015 at 9:17 AM Andrei Zh faithle...@gmail.com wrote: Jonathan, thanks for your support. So far I noticed that DNS gives pretty large delay. E.g. resolving IP addresses for 1000 URLs took 80 seconds in serial code and 26 seconds in muli-task code: Serial execution: julia @time for url in urls begin Base.getaddrinfo(URI(url).host) end end elapsed time: 80.071810293 seconds (732400 bytes allocated) Multitask execution: julia @time @sync for url in urls @async begin Base.getaddrinfo(URI(url).host) end end elapsed time: 26.241893516 seconds (4277968 bytes allocated) So I'll try to pre-resolve IPs and test again. On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote: As one of the maintainers of Requests.jl, I'm especially interested in its use for high-performance applications so don't hesitate to file an issue if it gives you any performance problems. On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote: Hi Steven, thanks for your answer! It turns out I misunderstood @async long time ago, assuming it also makes a remote call to other processes and thus introduces true multi-tasking. So now I need to rethink my approach before going further. Just to clarify: my goal is to perform as many requests as possible at the same time, so I want to use both - multiple processes (to start several requests at several cores in parallel) and tasks (to launch new requests while old ones are still waiting for IO to complete). So I will update my approach and come back with results or new questions. On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote: @parallel in Julia is for executing separate parallel processes (true multi-tasking, with separate address spaces). @async is for tasks, which are green threads and represent cooperative multitasking (within the same process and the same address space). I/O in Julia is asynchronous — while one task is blocked waiting on I/O, another task will wake up and start running. (This is based on the libuv library, which is designed for high-performance asynchronous I/O.) The first question is whether you want to fetch URLs in separate OS processes, or you want to use green threads within the same process. It sounds like you want the latter, in which case @async is the right thing. The second question is whether something about the Requests.jl package is serializing things somehow; for that you might file an issue at Requests.jl.
[julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?
Jonathan, thanks for your support. So far I noticed that DNS gives pretty large delay. E.g. resolving IP addresses for 1000 URLs took 80 seconds in serial code and 26 seconds in muli-task code: Serial execution: julia @time for url in urls begin Base.getaddrinfo(URI(url).host) end end elapsed time: 80.071810293 seconds (732400 bytes allocated) Multitask execution: julia @time @sync for url in urls # sleep(0.01) @async begin # t = @elapsed resp = get(url) # println(Status: $(resp.status) ($(t) sec)) Base.getaddrinfo(URI(url).host) end end elapsed time: 26.241893516 seconds (4277968 bytes allocated) So I'll try to pre-resolve IPs and test again. On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote: As one of the maintainers of Requests.jl, I'm especially interested in its use for high-performance applications so don't hesitate to file an issue if it gives you any performance problems. On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote: Hi Steven, thanks for your answer! It turns out I misunderstood @async long time ago, assuming it also makes a remote call to other processes and thus introduces true multi-tasking. So now I need to rethink my approach before going further. Just to clarify: my goal is to perform as many requests as possible at the same time, so I want to use both - multiple processes (to start several requests at several cores in parallel) and tasks (to launch new requests while old ones are still waiting for IO to complete). So I will update my approach and come back with results or new questions. On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote: @parallel in Julia is for executing separate parallel processes (true multi-tasking, with separate address spaces). @async is for tasks, which are green threads and represent cooperative multitasking (within the same process and the same address space). I/O in Julia is asynchronous — while one task is blocked waiting on I/O, another task will wake up and start running. (This is based on the libuv library, which is designed for high-performance asynchronous I/O.) The first question is whether you want to fetch URLs in separate OS processes, or you want to use green threads within the same process. It sounds like you want the latter, in which case @async is the right thing. The second question is whether something about the Requests.jl package is serializing things somehow; for that you might file an issue at Requests.jl.
[julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?
Jonathan, thanks for your support. So far I noticed that DNS gives pretty large delay. E.g. resolving IP addresses for 1000 URLs took 80 seconds in serial code and 26 seconds in muli-task code: Serial execution: julia @time for url in urls begin Base.getaddrinfo(URI(url).host) end end elapsed time: 80.071810293 seconds (732400 bytes allocated) Multitask execution: julia @time @sync for url in urls @async begin Base.getaddrinfo(URI(url).host) end end elapsed time: 26.241893516 seconds (4277968 bytes allocated) So I'll try to pre-resolve IPs and test again. On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote: As one of the maintainers of Requests.jl, I'm especially interested in its use for high-performance applications so don't hesitate to file an issue if it gives you any performance problems. On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote: Hi Steven, thanks for your answer! It turns out I misunderstood @async long time ago, assuming it also makes a remote call to other processes and thus introduces true multi-tasking. So now I need to rethink my approach before going further. Just to clarify: my goal is to perform as many requests as possible at the same time, so I want to use both - multiple processes (to start several requests at several cores in parallel) and tasks (to launch new requests while old ones are still waiting for IO to complete). So I will update my approach and come back with results or new questions. On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote: @parallel in Julia is for executing separate parallel processes (true multi-tasking, with separate address spaces). @async is for tasks, which are green threads and represent cooperative multitasking (within the same process and the same address space). I/O in Julia is asynchronous — while one task is blocked waiting on I/O, another task will wake up and start running. (This is based on the libuv library, which is designed for high-performance asynchronous I/O.) The first question is whether you want to fetch URLs in separate OS processes, or you want to use green threads within the same process. It sounds like you want the latter, in which case @async is the right thing. The second question is whether something about the Requests.jl package is serializing things somehow; for that you might file an issue at Requests.jl.
[julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?
As one of the maintainers of Requests.jl, I'm especially interested in its use for high-performance applications so don't hesitate to file an issue if it gives you any performance problems. On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote: Hi Steven, thanks for your answer! It turns out I misunderstood @async long time ago, assuming it also makes a remote call to other processes and thus introduces true multi-tasking. So now I need to rethink my approach before going further. Just to clarify: my goal is to perform as many requests as possible at the same time, so I want to use both - multiple processes (to start several requests at several cores in parallel) and tasks (to launch new requests while old ones are still waiting for IO to complete). So I will update my approach and come back with results or new questions. On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote: @parallel in Julia is for executing separate parallel processes (true multi-tasking, with separate address spaces). @async is for tasks, which are green threads and represent cooperative multitasking (within the same process and the same address space). I/O in Julia is asynchronous — while one task is blocked waiting on I/O, another task will wake up and start running. (This is based on the libuv library, which is designed for high-performance asynchronous I/O.) The first question is whether you want to fetch URLs in separate OS processes, or you want to use green threads within the same process. It sounds like you want the latter, in which case @async is the right thing. The second question is whether something about the Requests.jl package is serializing things somehow; for that you might file an issue at Requests.jl.
Re: [julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?
Thanks for the report, Andrei - Would you mind filing this is an issue at https://github.com/JuliaWeb/Requests.jl? On Mon, Aug 24, 2015 at 9:17 AM, Andrei Zh faithlessfri...@gmail.com wrote: Jonathan, thanks for your support. So far I noticed that DNS gives pretty large delay. E.g. resolving IP addresses for 1000 URLs took 80 seconds in serial code and 26 seconds in muli-task code: Serial execution: julia @time for url in urls begin Base.getaddrinfo(URI(url).host) end end elapsed time: 80.071810293 seconds (732400 bytes allocated) Multitask execution: julia @time @sync for url in urls @async begin Base.getaddrinfo(URI(url).host) end end elapsed time: 26.241893516 seconds (4277968 bytes allocated) So I'll try to pre-resolve IPs and test again. On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote: As one of the maintainers of Requests.jl, I'm especially interested in its use for high-performance applications so don't hesitate to file an issue if it gives you any performance problems. On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote: Hi Steven, thanks for your answer! It turns out I misunderstood @async long time ago, assuming it also makes a remote call to other processes and thus introduces true multi-tasking. So now I need to rethink my approach before going further. Just to clarify: my goal is to perform as many requests as possible at the same time, so I want to use both - multiple processes (to start several requests at several cores in parallel) and tasks (to launch new requests while old ones are still waiting for IO to complete). So I will update my approach and come back with results or new questions. On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote: @parallel in Julia is for executing separate parallel processes (true multi-tasking, with separate address spaces). @async is for tasks, which are green threads and represent cooperative multitasking (within the same process and the same address space). I/O in Julia is asynchronous — while one task is blocked waiting on I/O, another task will wake up and start running. (This is based on the libuv library, which is designed for high-performance asynchronous I/O.) The first question is whether you want to fetch URLs in separate OS processes, or you want to use green threads within the same process. It sounds like you want the latter, in which case @async is the right thing. The second question is whether something about the Requests.jl package is serializing things somehow; for that you might file an issue at Requests.jl.
Re: [julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?
Name resolution delays are generally an issue with network latency. Trying to resolve 1000 uncached names will take a while on any system: seth@schroeder:~$ time host www.julialang.org www.julialang.org is an alias for julialang.github.io. julialang.github.io is an alias for github.map.fastly.net. github.map.fastly.net has address 23.235.47.133 real 0m3.268s user 0m0.003s sys 0m0.020s Unless your server is localhost, you're going to have delays. On Monday, August 24, 2015 at 6:49:47 AM UTC-7, Jonathan Malmaud wrote: Thanks for the report, Andrei - Would you mind filing this is an issue at https://github.com/JuliaWeb/Requests.jl? On Mon, Aug 24, 2015 at 9:17 AM, Andrei Zh faithle...@gmail.com javascript: wrote: Jonathan, thanks for your support. So far I noticed that DNS gives pretty large delay. E.g. resolving IP addresses for 1000 URLs took 80 seconds in serial code and 26 seconds in muli-task code: Serial execution: julia @time for url in urls begin Base.getaddrinfo(URI(url).host) end end elapsed time: 80.071810293 seconds (732400 bytes allocated) Multitask execution: julia @time @sync for url in urls @async begin Base.getaddrinfo(URI(url).host) end end elapsed time: 26.241893516 seconds (4277968 bytes allocated) So I'll try to pre-resolve IPs and test again. On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote: As one of the maintainers of Requests.jl, I'm especially interested in its use for high-performance applications so don't hesitate to file an issue if it gives you any performance problems. On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote: Hi Steven, thanks for your answer! It turns out I misunderstood @async long time ago, assuming it also makes a remote call to other processes and thus introduces true multi-tasking. So now I need to rethink my approach before going further. Just to clarify: my goal is to perform as many requests as possible at the same time, so I want to use both - multiple processes (to start several requests at several cores in parallel) and tasks (to launch new requests while old ones are still waiting for IO to complete). So I will update my approach and come back with results or new questions. On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote: @parallel in Julia is for executing separate parallel processes (true multi-tasking, with separate address spaces). @async is for tasks, which are green threads and represent cooperative multitasking (within the same process and the same address space). I/O in Julia is asynchronous — while one task is blocked waiting on I/O, another task will wake up and start running. (This is based on the libuv library, which is designed for high-performance asynchronous I/O.) The first question is whether you want to fetch URLs in separate OS processes, or you want to use green threads within the same process. It sounds like you want the latter, in which case @async is the right thing. The second question is whether something about the Requests.jl package is serializing things somehow; for that you might file an issue at Requests.jl.
Re: [julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?
Did you try the DNS servers from Google, e.g. 8.8.8.8 ? I never saw a reply that needs more than one second. (Well, in our university network.) Am Montag, 24. August 2015 16:25:06 UTC+2 schrieb Seth: Name resolution delays are generally an issue with network latency. Trying to resolve 1000 uncached names will take a while on any system: seth@schroeder:~$ time host www.julialang.org www.julialang.org is an alias for julialang.github.io. julialang.github.io is an alias for github.map.fastly.net. github.map.fastly.net has address 23.235.47.133 real 0m3.268s user 0m0.003s sys 0m0.020s Unless your server is localhost, you're going to have delays. On Monday, August 24, 2015 at 6:49:47 AM UTC-7, Jonathan Malmaud wrote: Thanks for the report, Andrei - Would you mind filing this is an issue at https://github.com/JuliaWeb/Requests.jl? On Mon, Aug 24, 2015 at 9:17 AM, Andrei Zh faithle...@gmail.com wrote: Jonathan, thanks for your support. So far I noticed that DNS gives pretty large delay. E.g. resolving IP addresses for 1000 URLs took 80 seconds in serial code and 26 seconds in muli-task code: Serial execution: julia @time for url in urls begin Base.getaddrinfo(URI(url).host) end end elapsed time: 80.071810293 seconds (732400 bytes allocated) Multitask execution: julia @time @sync for url in urls @async begin Base.getaddrinfo(URI(url).host) end end elapsed time: 26.241893516 seconds (4277968 bytes allocated) So I'll try to pre-resolve IPs and test again. On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote: As one of the maintainers of Requests.jl, I'm especially interested in its use for high-performance applications so don't hesitate to file an issue if it gives you any performance problems. On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote: Hi Steven, thanks for your answer! It turns out I misunderstood @async long time ago, assuming it also makes a remote call to other processes and thus introduces true multi-tasking. So now I need to rethink my approach before going further. Just to clarify: my goal is to perform as many requests as possible at the same time, so I want to use both - multiple processes (to start several requests at several cores in parallel) and tasks (to launch new requests while old ones are still waiting for IO to complete). So I will update my approach and come back with results or new questions. On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote: @parallel in Julia is for executing separate parallel processes (true multi-tasking, with separate address spaces). @async is for tasks, which are green threads and represent cooperative multitasking (within the same process and the same address space). I/O in Julia is asynchronous — while one task is blocked waiting on I/O, another task will wake up and start running. (This is based on the libuv library, which is designed for high-performance asynchronous I/O.) The first question is whether you want to fetch URLs in separate OS processes, or you want to use green threads within the same process. It sounds like you want the latter, in which case @async is the right thing. The second question is whether something about the Requests.jl package is serializing things somehow; for that you might file an issue at Requests.jl.
Re: [julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?
Uni networks are typically high bandwidth and low latency and are optimally peered. My home network, unfortunately, exhibits none of those traits :) In any case, even a 70-millisecond resolution time results in 70 seconds of delay for 1000 uncached resolutions. On Monday, August 24, 2015 at 7:42:16 AM UTC-7, Uwe Fechner wrote: Did you try the DNS servers from Google, e.g. 8.8.8.8 ? I never saw a reply that needs more than one second. (Well, in our university network.) Am Montag, 24. August 2015 16:25:06 UTC+2 schrieb Seth: Name resolution delays are generally an issue with network latency. Trying to resolve 1000 uncached names will take a while on any system: seth@schroeder:~$ time host www.julialang.org www.julialang.org is an alias for julialang.github.io. julialang.github.io is an alias for github.map.fastly.net. github.map.fastly.net has address 23.235.47.133 real 0m3.268s user 0m0.003s sys 0m0.020s Unless your server is localhost, you're going to have delays. On Monday, August 24, 2015 at 6:49:47 AM UTC-7, Jonathan Malmaud wrote: Thanks for the report, Andrei - Would you mind filing this is an issue at https://github.com/JuliaWeb/Requests.jl? On Mon, Aug 24, 2015 at 9:17 AM, Andrei Zh faithle...@gmail.com wrote: Jonathan, thanks for your support. So far I noticed that DNS gives pretty large delay. E.g. resolving IP addresses for 1000 URLs took 80 seconds in serial code and 26 seconds in muli-task code: Serial execution: julia @time for url in urls begin Base.getaddrinfo(URI(url).host) end end elapsed time: 80.071810293 seconds (732400 bytes allocated) Multitask execution: julia @time @sync for url in urls @async begin Base.getaddrinfo(URI(url).host) end end elapsed time: 26.241893516 seconds (4277968 bytes allocated) So I'll try to pre-resolve IPs and test again. On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote: As one of the maintainers of Requests.jl, I'm especially interested in its use for high-performance applications so don't hesitate to file an issue if it gives you any performance problems. On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote: Hi Steven, thanks for your answer! It turns out I misunderstood @async long time ago, assuming it also makes a remote call to other processes and thus introduces true multi-tasking. So now I need to rethink my approach before going further. Just to clarify: my goal is to perform as many requests as possible at the same time, so I want to use both - multiple processes (to start several requests at several cores in parallel) and tasks (to launch new requests while old ones are still waiting for IO to complete). So I will update my approach and come back with results or new questions. On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote: @parallel in Julia is for executing separate parallel processes (true multi-tasking, with separate address spaces). @async is for tasks, which are green threads and represent cooperative multitasking (within the same process and the same address space). I/O in Julia is asynchronous — while one task is blocked waiting on I/O, another task will wake up and start running. (This is based on the libuv library, which is designed for high-performance asynchronous I/O.) The first question is whether you want to fetch URLs in separate OS processes, or you want to use green threads within the same process. It sounds like you want the latter, in which case @async is the right thing. The second question is whether something about the Requests.jl package is serializing things somehow; for that you might file an issue at Requests.jl.
Re: [julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?
If you are doing a lot of parallel dns queries, you may want to try increasing the number that can be run simultaneously but setting the UV_THREADPOOL_SIZE environment variable before starting julia to something larger (default is 4, max is 128). On Mon, Aug 24, 2015 at 9:17 AM Andrei Zh faithlessfri...@gmail.com wrote: Jonathan, thanks for your support. So far I noticed that DNS gives pretty large delay. E.g. resolving IP addresses for 1000 URLs took 80 seconds in serial code and 26 seconds in muli-task code: Serial execution: julia @time for url in urls begin Base.getaddrinfo(URI(url).host) end end elapsed time: 80.071810293 seconds (732400 bytes allocated) Multitask execution: julia @time @sync for url in urls @async begin Base.getaddrinfo(URI(url).host) end end elapsed time: 26.241893516 seconds (4277968 bytes allocated) So I'll try to pre-resolve IPs and test again. On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote: As one of the maintainers of Requests.jl, I'm especially interested in its use for high-performance applications so don't hesitate to file an issue if it gives you any performance problems. On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote: Hi Steven, thanks for your answer! It turns out I misunderstood @async long time ago, assuming it also makes a remote call to other processes and thus introduces true multi-tasking. So now I need to rethink my approach before going further. Just to clarify: my goal is to perform as many requests as possible at the same time, so I want to use both - multiple processes (to start several requests at several cores in parallel) and tasks (to launch new requests while old ones are still waiting for IO to complete). So I will update my approach and come back with results or new questions. On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote: @parallel in Julia is for executing separate parallel processes (true multi-tasking, with separate address spaces). @async is for tasks, which are green threads and represent cooperative multitasking (within the same process and the same address space). I/O in Julia is asynchronous — while one task is blocked waiting on I/O, another task will wake up and start running. (This is based on the libuv library, which is designed for high-performance asynchronous I/O.) The first question is whether you want to fetch URLs in separate OS processes, or you want to use green threads within the same process. It sounds like you want the latter, in which case @async is the right thing. The second question is whether something about the Requests.jl package is serializing things somehow; for that you might file an issue at Requests.jl.
Re: [julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?
@Jameson: setting UV_THREADPOOL_SIZE to 32 seems to reduce DNS resolution time twice (from 26 to 12 seconds on my latest tests), so thank you. However, DNS seems to be not the only root of the problem: I noticed that with large number of URLs first ones get result very quickly (300-800ms), but then latency begins to grow until timeout is exceeded (I'm seen more then 100 seconds for some requests). I will try to set up stable test and post it here as well as to issues in Request.jl. On Monday, August 24, 2015 at 8:21:51 PM UTC+3, Jameson wrote: If you are doing a lot of parallel dns queries, you may want to try increasing the number that can be run simultaneously but setting the UV_THREADPOOL_SIZE environment variable before starting julia to something larger (default is 4, max is 128). On Mon, Aug 24, 2015 at 9:17 AM Andrei Zh faithle...@gmail.com javascript: wrote: Jonathan, thanks for your support. So far I noticed that DNS gives pretty large delay. E.g. resolving IP addresses for 1000 URLs took 80 seconds in serial code and 26 seconds in muli-task code: Serial execution: julia @time for url in urls begin Base.getaddrinfo(URI(url).host) end end elapsed time: 80.071810293 seconds (732400 bytes allocated) Multitask execution: julia @time @sync for url in urls @async begin Base.getaddrinfo(URI(url).host) end end elapsed time: 26.241893516 seconds (4277968 bytes allocated) So I'll try to pre-resolve IPs and test again. On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote: As one of the maintainers of Requests.jl, I'm especially interested in its use for high-performance applications so don't hesitate to file an issue if it gives you any performance problems. On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote: Hi Steven, thanks for your answer! It turns out I misunderstood @async long time ago, assuming it also makes a remote call to other processes and thus introduces true multi-tasking. So now I need to rethink my approach before going further. Just to clarify: my goal is to perform as many requests as possible at the same time, so I want to use both - multiple processes (to start several requests at several cores in parallel) and tasks (to launch new requests while old ones are still waiting for IO to complete). So I will update my approach and come back with results or new questions. On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote: @parallel in Julia is for executing separate parallel processes (true multi-tasking, with separate address spaces). @async is for tasks, which are green threads and represent cooperative multitasking (within the same process and the same address space). I/O in Julia is asynchronous — while one task is blocked waiting on I/O, another task will wake up and start running. (This is based on the libuv library, which is designed for high-performance asynchronous I/O.) The first question is whether you want to fetch URLs in separate OS processes, or you want to use green threads within the same process. It sounds like you want the latter, in which case @async is the right thing. The second question is whether something about the Requests.jl package is serializing things somehow; for that you might file an issue at Requests.jl.
[julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?
More generally, is there anything like futures/callbacks or async IO in Julia? I couldn't find anything, but maybe I just don't know the right keywords.
[julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?
Hi Steven, thanks for your answer! It turns out I misunderstood @async long time ago, assuming it also makes a remote call to other processes and thus introduces true multi-tasking. So now I need to rethink my approach before going further. Just to clarify: my goal is to perform as many requests as possible at the same time, so I want to use both - multiple processes (to start several requests at several cores in parallel) and tasks (to launch new requests while old ones are still waiting for IO to complete). So I will update my approach and come back with results or new questions. On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson wrote: @parallel in Julia is for executing separate parallel processes (true multi-tasking, with separate address spaces). @async is for tasks, which are green threads and represent cooperative multitasking (within the same process and the same address space). I/O in Julia is asynchronous — while one task is blocked waiting on I/O, another task will wake up and start running. (This is based on the libuv library, which is designed for high-performance asynchronous I/O.) The first question is whether you want to fetch URLs in separate OS processes, or you want to use green threads within the same process. It sounds like you want the latter, in which case @async is the right thing. The second question is whether something about the Requests.jl package is serializing things somehow; for that you might file an issue at Requests.jl.
[julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?
@parallel in Julia is for executing separate parallel processes (true multi-tasking, with separate address spaces). @async is for tasks, which are green threads and represent cooperative multitasking (within the same process and the same address space). I/O in Julia is asynchronous — while one task is blocked waiting on I/O, another task will wake up and start running. (This is based on the libuv library, which is designed for high-performance asynchronous I/O.) The first question is whether you want to fetch URLs in separate OS processes, or you want to use green threads within the same process. It sounds like you want the latter, in which case @async is the right thing. The second question is whether something about the Requests.jl package is serializing things somehow; for that you might file an issue at Requests.jl.
[julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?
BTW, I'm not sure I use @parallel correctly, so I tried to start tasks manually with @async too: @time @sync for url in urls @async begin resp = get(url) println(Status: $(resp.status)) end end But I didn't notice any difference in performance. On Sunday, August 23, 2015 at 5:52:27 AM UTC+3, Andrei Zh wrote: I'm writing a kind of a web scanner that should retrieve and analyze about 100k URLs as fast as possible. Of course, it will take time anyway, but I'm looking for how to utilize my CPUs and network as much as possible. My initial approach was to add all available processors, pack urls into tasks and run these tasks in parallel: using Requests urls = ... @time @sync @parallel for url in urls resp = get(url) println(Status: $(resp.status)) end My assumption was that 100k tasks would be created, each task would execute GET request and, since this is IO operation, free current thread for other tasks. From logs, however, I see that each worker executes tasks one by one, every time waiting for GET request to finish. So how do I start 100k requests in parallel? (100k is here just for example, I can easily split then into chunks of 10k, for example, so system limits and overused CPU/network are not an issue; issue is in their *underutilization*). Thanks