Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-30 Thread Antoine Pitrou
Le 30/09/2020 à 02:58, Pierre Belzile a écrit : > Hi, > > Some thoughts: > 1. For async IO, the system must have threads that quickly service the > callback. Otherwise the S3/GCS end will close the connection. A single > thread pool where all the threads are doing an expensive compute operation

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-29 Thread Pierre Belzile
Hi, Some thoughts: 1. For async IO, the system must have threads that quickly service the callback. Otherwise the S3/GCS end will close the connection. A single thread pool where all the threads are doing an expensive compute operation (like CSV decoding or regex matching) can starve the IO. 2.

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-29 Thread Weston Pace
Antoine/Wes, thanks for the input. I will focus on the CSV reader and the minimal async needed to get I/O off the thread pool and support for a nested task group. This is just to focus on one small thing at a time. I'll avoid any scheduler work for now but maybe can look at that in the future.

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-28 Thread Antoine Pitrou
Le 28/09/2020 à 11:38, Antoine Pitrou a écrit : > > Hi Weston, > > Le 25/09/2020 à 23:21, Weston Pace a écrit : >> >> * The current thread pool implementation deadlocks when used in a >> "nested" case, an asynchronous solution can work around this > > If required it may be possible to hack

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-28 Thread Antoine Pitrou
Hi Weston, Le 25/09/2020 à 23:21, Weston Pace a écrit : > > * The current thread pool implementation deadlocks when used in a > "nested" case, an asynchronous solution can work around this If required it may be possible to hack around this. For example, AFAIR TBB has a simple heuristic to

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-27 Thread Wes McKinney
Hi Weston -- this is a really interesting analysis. 1. I have been under the assumption that the current libraries work poorly on high latency file systems, and your analysis provides the proof, so thank you. 2. This shows that we have a lot of work to do to retool many of our IO libraries

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-25 Thread Weston Pace
So this may be a return to the details, I think the larger discussion is a good discussion to have but I don't know enough of the code base to comment further. I finished playing around with the CSV reader. The code for this experiment can be found here

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-24 Thread Wes McKinney
Not C++, but I found the discussion about the Rust's tokio project's scheduler to be interesting / relevant https://tokio.rs/blog/2019-10-scheduler On Tue, Sep 22, 2020 at 4:54 PM Wes McKinney wrote: > > Thanks for the pointer to CAF. It reminds me a bit of libprocess which > is a part of

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-22 Thread Wes McKinney
Thanks for the pointer to CAF. It reminds me a bit of libprocess which is a part of Apache Mesos, which also provides the actor model https://github.com/apache/mesos/tree/master/3rdparty/libprocess We'll have to determine a solution that is compatible with our spectrum of compiler toolchain

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-22 Thread Matthias Vallentin
We are building a highly concurrent database for security data with Arrow as data plane (VAST ), so I thought I'll share our view on this since we went over pretty much all of the above mentioned questions. I'm not trying to say "you should do it this way" but

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-21 Thread Ben Kietzman
FWIW boost.coroutine and boost.asio provide composable coroutines, non blocking IO, and configurable scheduling for CPU work out of the box. The boost libraries are not lightweight but they are robust and cross platform, so I think asio is worth consideration. On Sat, Sep 19, 2020 at 8:22 PM Wes

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-19 Thread Wes McKinney
I took a look at https://github.com/kpamnany/partr and Julia's production iteration of that -- kpamnany/partr depends on libconcurrent's coroutine implementation which does not work on Windows. It appears that Julia is using libuv instead. If we're looking for a lighter-weight C coroutine

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-19 Thread Weston Pace
Ok, my skill with C++ got in the way of my ability to put something together. First, I did not realize that C++ futures were a little different than the definition I'm used to for futures. By default, C++ futures are not composable, you can't add continuations with `then`, `when_all` or

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-16 Thread Weston Pace
If you want to specifically look at the problem of dataset scanning, file scanning, and nested parallelism then probably the lowest effort improvement would be to eliminate the whole idea of "scan threads". You currently have... for (size_t i = 0; i < readers.size(); ++i) {

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-16 Thread Wes McKinney
On Wed, Sep 16, 2020 at 10:31 AM Jorge Cardoso Leitão wrote: > > Hi, > > I am not sure I fully understand, so I will try to give an example to > check: we have a simple query that we want to write the result to some > place: > > SELECT t1.b * t2.b FROM t1 JOIN ON t2 WHERE t1.a = t2.a > > At the

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-16 Thread Wes McKinney
On Wed, Sep 16, 2020 at 10:49 AM Adam Hooper wrote: > > On Tue, Sep 15, 2020 at 1:00 PM Wes McKinney wrote: > > > We have additional problems in that some file-loading related tasks do > > a mixture of CPU work and IO work, and once a thread has been > > dispatched to execute one of these tasks,

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-16 Thread Adam Hooper
On Tue, Sep 15, 2020 at 1:00 PM Wes McKinney wrote: > We have additional problems in that some file-loading related tasks do > a mixture of CPU work and IO work, and once a thread has been > dispatched to execute one of these tasks, when IO takes place, a CPU > core may sit underutilized while

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-16 Thread Jorge Cardoso Leitão
Hi, I am not sure I fully understand, so I will try to give an example to check: we have a simple query that we want to write the result to some place: SELECT t1.b * t2.b FROM t1 JOIN ON t2 WHERE t1.a = t2.a At the physical plane, we need to 1. read each file in batches 2. join the batches 3.

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-16 Thread Wes McKinney
hi Jacob, The approach taken in Julia strikes me as being motivated by the same problems that we have in this project. It would be interesting if partr could be used as the basis of our nested parallelism runtime. How does Julia handle IO calls within spawned tasks? In other words, if we have a

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-16 Thread Jacob Quinn
My immediate thought reading the discussion points was Julia's task-based multithreading model that has been part of the language for over a year now. An announcement blogpost for Julia 1.3 laid out some of the details and high-level approach: https://julialang.org/blog/2019/07/multithreading/,

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-15 Thread Weston Pace
My C++ is pretty rusty but I'll see if I can come up with a concrete CSV example / experiment / proof of concept on Friday when I have a break from work. On Tue, Sep 15, 2020 at 3:47 PM Wes McKinney wrote: > > On Tue, Sep 15, 2020 at 7:54 PM Weston Pace wrote: > > > > Yes. Thank you. I am in

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-15 Thread Wes McKinney
On Tue, Sep 15, 2020 at 7:54 PM Weston Pace wrote: > > Yes. Thank you. I am in agreement with you and futures/callbacks are > one such "richer programming model for > hierarchical work scheduling". > > A scan task with a naive approach is: > > workers = partition_files_list(files_list) >

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-15 Thread Weston Pace
Yes. Thank you. I am in agreement with you and futures/callbacks are one such "richer programming model for hierarchical work scheduling". A scan task with a naive approach is: workers = partition_files_list(files_list) for worker in workers: start_thread(worker) for worker

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-15 Thread Wes McKinney
hi Weston, We've discussed some of these problems in the past -- I was enumerating some of these issues to highlight the problems that are resulting from an absence of a richer programming model for hierarchical work scheduling. Parallel tasks originating in each workload are submitted to a

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-15 Thread Weston Pace
It sounds like you are describing two problems. 1) Idleness - Tasks are holding threads in the thread pool while they wait for IO or some long running non-CPU task to complete. These threads are often in a "wait" state or something similar. 2) Fairness - The ordering of tasks is causing short

[DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-15 Thread Wes McKinney
In light of ARROW-9924, I wanted to rekindle the discussion about our approach to multithreading (especially the _programming model_) in C++. We had some discussions about this about 6 months ago and there were more discussions as I recall in summer 2019. Realistically, we are going to be