David, Thanks so much for breaking everything down in such a digestible way! This has helped hugely in my understanding of Arrow Flight and some general concurrency concepts with Python.
Thank you sincerely! On Tue, Sep 7, 2021 at 4:39 PM David Li <[email protected]> wrote: > Yes, that's correct - Flight is not really handling concurrency itself and > leaves that to gRPC, so you could study gRPC itself instead. (That said, > there are likely more accessible frameworks available for study, including > some pure-Python implementations of gRPC.) This is in part because gRPC is > high level enough that there is no real need to manage concurrency > ourselves (though you can choose to do this in part by using the async > APIs), and in part because we would like to support things other than gRPC > in the future, so the code needs to be kept reasonably abstracted away from > those concerns. > > And yes, if you have shared mutable state in your Python process, other > threading primitives, like locks, are still useful, even if the threads are > not managed by Arrow/Python. Arrow, of course, does make sure to acquire > the GIL as required before calling into Python; this also means that > despite handling the bulk of a request in C++, there will naturally be a > limit to the scalability of a single-process Flight server written in > Python. > > I think I just saw a StackOverflow question from you? In which case I'll > mention again that the multiprocessing module is generally not safe to use > with Flight, in this case because gRPC has process-global state that needs > to be taken care of before/after fork(), and we do not handle this case > currently. (Note this is in contrast to the grpcio module/gRPC *for > Python*, which is a *separate* implementation of gRPC and does take care > of this!) > > I presume Flight is able to handle all of this on a single process because > it isn’t necessarily a CPU-intensive process with just pushing and pulling > data, which would primarily be network transfers—is that right? > > > A couple things here. Ultimately, for a Python service, you are still > bound by the GIL. But since most of the code is in C++ and isn't calling > into Python, the GIL is less limiting. For instance, if you return a > RecordBatchStream from a DoGet endpoint, that contains a reference to Arrow > data, which will be extracted and sent over the wire in C++ without further > interaction from Python, leaving the Python interpreter free to handle > another request while the C++ code takes care of the I/O. And yes, Flight > services are presumably I/O bound more than CPU bound, and Flight contains > some optimizations to help ensure that by reducing or eliminating copying > of data where possible. > > > -David > > On Tue, Sep 7, 2021, at 19:24, Michael Ark wrote: > > David, > > Thanks, that’s very helpful. I had suspected as much as I began to dig > into the code. I’m rather weak with concurrency and would like to see how > Arrow Flight is handling every request it gets. Are you suggesting that > even for Arrow Flight, it’s under-the-hood and the concurrency is actually > specific to gRPC—meaning that if I look through Apache Arrow’s source code, > the threading logic would be abstracted to the gRPC dependency? > > Presumably, this means that if I have stateful variables on my running > server in Python, I need to manage my own locks to ensure my data > structures are thread safe, though the actual management of threads would > be much farther upstream? > > On Tue, Sep 7, 2021 at 4:18 PM David Li <[email protected]> wrote: > > > Hey Michael, > > The key thing is that the threads are all managed by gRPC's C++ > implementation. On the server side, the C++ libraries underneath handle > incoming requests, encoding/decoding responses, etc. all concurrently > without calling into Python. Arrow calls into Python only for the actual > RPC endpoint logic. This is all multithreaded and within a single process. > (In fact, you probably should avoid fork() and things built on top of it, > like the multiprocessing module - it will not play well with the C++-level > libraries.) Threading is all managed by the C++ library and so there is not > any one place to look at, is there something specific you were looking for? > > Best, > David > > On Tue, Sep 7, 2021, at 18:45, Michael Ark wrote: > > I am relatively new to Python and Arrow Flight. I want to understand how > Arrow Flight works with multiple clients making multiple requests to a > single server. It seems like Arrow Flight handles concurrency. Is it > multithreaded, but single process? How are the threads managed? Where can I > find this logic? When I try to track the threads in the server with > logging, I get DummyThreads, so it’s not very helpful. > > #arrow-flight > > Thanks! Appreciate any help you can provide. > > > >
