That’s quite a rabbit hole I’ve uncovered. This is super helpful and thank you for the cookie crumbs, I will dig deeper into these items. I definitely don’t think I would have figured out the differences for the gRPC versions.
> Or you can start a multiprocessing pool before importing Flight I get that forking before the gRPC process is invoked MIGHT circumvent the problem you’ve described, but why does it have to be before **importing**? Would it not have to take place before the server begins, which would be after the flight server starts up and begins running/serving? Thank you for your time! On Thu, Sep 9, 2021 at 2:56 PM David Li <[email protected]> wrote: > Essentially, your process is liable to crash. This has nothing to do with > Python; it's because the gRPC C++ library initializes some process-global > state (including some locks), which get invalidated after fork(), and when > gRPC goes to use that state afterwards, it'll crash. Using pickle, Future, > or whatever doesn't affect this; as soon as you fork() after using gRPC, > your new process is irrevocably doomed. > > There's some vague references to this in the gRPC docs, along with things > that can go wrong: > https://github.com/grpc/grpc/blob/master/doc/fork_support.md Note that > the support referenced there only applies to the Python library "grpcio", > so as far as PyArrow/Flight is concerned, you absolutely should not > fork()/use multiprocessing. It's a little confusing: gRPC has a "Core" > library, upon which the C++ library (used by Flight, including in PyArrow) > and the Python library ("grpcio", not used by Arrow, including in PyArrow) > are built. "grpcio" has special support for fork(), but gRPC/C++ (and hence > Flight) does not. > > You *may* be able to get away with using multiprocessing if the child > processes never touch gRPC/Flight (that said, we don't test this at all). > Or you can start a multiprocessing pool before importing Flight, and/or use > the 'spawn' method[1] ('forkserver' may also work), so that the child > processes don't have any gRPC state. > > [1]: > https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods > > -David > > On Thu, Sep 9, 2021, at 17:01, Michael Ark wrote: > > David, > > To follow up on something you mentioned. > > > In fact, you probably should avoid fork() and things built on top of it, > like the multiprocessing module - it will not play well with the C++-level > libraries. > > Why would this be a problem, and how would a user of Flight know about > this? What would not playing well look like? Why would something like > ProcessPoolExecutor, which would work only on pickled objects and not part > of the shared state, cause problems? Does returning a Futures object for a > completely independent process to the state of the server really cause an > issue? > > What is the best way to conduct a CPU-heavy process on Flight? > > Thanks, > Michael > > On Tue, Sep 7, 2021 at 4:46 PM Michael Ark <[email protected]> > wrote: > > David, > > Thanks so much for breaking everything down in such a digestible way! This > has helped hugely in my understanding of Arrow Flight and some general > concurrency concepts with Python. > > Thank you sincerely! > > On Tue, Sep 7, 2021 at 4:39 PM David Li <[email protected]> wrote: > > > Yes, that's correct - Flight is not really handling concurrency itself and > leaves that to gRPC, so you could study gRPC itself instead. (That said, > there are likely more accessible frameworks available for study, including > some pure-Python implementations of gRPC.) This is in part because gRPC is > high level enough that there is no real need to manage concurrency > ourselves (though you can choose to do this in part by using the async > APIs), and in part because we would like to support things other than gRPC > in the future, so the code needs to be kept reasonably abstracted away from > those concerns. > > And yes, if you have shared mutable state in your Python process, other > threading primitives, like locks, are still useful, even if the threads are > not managed by Arrow/Python. Arrow, of course, does make sure to acquire > the GIL as required before calling into Python; this also means that > despite handling the bulk of a request in C++, there will naturally be a > limit to the scalability of a single-process Flight server written in > Python. > > I think I just saw a StackOverflow question from you? In which case I'll > mention again that the multiprocessing module is generally not safe to use > with Flight, in this case because gRPC has process-global state that needs > to be taken care of before/after fork(), and we do not handle this case > currently. (Note this is in contrast to the grpcio module/gRPC *for > Python*, which is a *separate* implementation of gRPC and does take care > of this!) > > I presume Flight is able to handle all of this on a single process because > it isn’t necessarily a CPU-intensive process with just pushing and pulling > data, which would primarily be network transfers—is that right? > > > A couple things here. Ultimately, for a Python service, you are still > bound by the GIL. But since most of the code is in C++ and isn't calling > into Python, the GIL is less limiting. For instance, if you return a > RecordBatchStream from a DoGet endpoint, that contains a reference to Arrow > data, which will be extracted and sent over the wire in C++ without further > interaction from Python, leaving the Python interpreter free to handle > another request while the C++ code takes care of the I/O. And yes, Flight > services are presumably I/O bound more than CPU bound, and Flight contains > some optimizations to help ensure that by reducing or eliminating copying > of data where possible. > > > -David > > On Tue, Sep 7, 2021, at 19:24, Michael Ark wrote: > > David, > > Thanks, that’s very helpful. I had suspected as much as I began to dig > into the code. I’m rather weak with concurrency and would like to see how > Arrow Flight is handling every request it gets. Are you suggesting that > even for Arrow Flight, it’s under-the-hood and the concurrency is actually > specific to gRPC—meaning that if I look through Apache Arrow’s source code, > the threading logic would be abstracted to the gRPC dependency? > > Presumably, this means that if I have stateful variables on my running > server in Python, I need to manage my own locks to ensure my data > structures are thread safe, though the actual management of threads would > be much farther upstream? > > On Tue, Sep 7, 2021 at 4:18 PM David Li <[email protected]> wrote: > > > Hey Michael, > > The key thing is that the threads are all managed by gRPC's C++ > implementation. On the server side, the C++ libraries underneath handle > incoming requests, encoding/decoding responses, etc. all concurrently > without calling into Python. Arrow calls into Python only for the actual > RPC endpoint logic. This is all multithreaded and within a single process. > (In fact, you probably should avoid fork() and things built on top of it, > like the multiprocessing module - it will not play well with the C++-level > libraries.) Threading is all managed by the C++ library and so there is not > any one place to look at, is there something specific you were looking for? > > Best, > David > > On Tue, Sep 7, 2021, at 18:45, Michael Ark wrote: > > I am relatively new to Python and Arrow Flight. I want to understand how > Arrow Flight works with multiple clients making multiple requests to a > single server. It seems like Arrow Flight handles concurrency. Is it > multithreaded, but single process? How are the threads managed? Where can I > find this logic? When I try to track the threads in the server with > logging, I get DummyThreads, so it’s not very helpful. > > #arrow-flight > > Thanks! Appreciate any help you can provide. > > > > >
