Essentially, your process is liable to crash. This has nothing to do with Python; it's because the gRPC C++ library initializes some process-global state (including some locks), which get invalidated after fork(), and when gRPC goes to use that state afterwards, it'll crash. Using pickle, Future, or whatever doesn't affect this; as soon as you fork() after using gRPC, your new process is irrevocably doomed.
There's some vague references to this in the gRPC docs, along with things that can go wrong: https://github.com/grpc/grpc/blob/master/doc/fork_support.md Note that the support referenced there only applies to the Python library "grpcio", so as far as PyArrow/Flight is concerned, you absolutely should not fork()/use multiprocessing. It's a little confusing: gRPC has a "Core" library, upon which the C++ library (used by Flight, including in PyArrow) and the Python library ("grpcio", not used by Arrow, including in PyArrow) are built. "grpcio" has special support for fork(), but gRPC/C++ (and hence Flight) does not. You *may* be able to get away with using multiprocessing if the child processes never touch gRPC/Flight (that said, we don't test this at all). Or you can start a multiprocessing pool before importing Flight, and/or use the 'spawn' method[1] ('forkserver' may also work), so that the child processes don't have any gRPC state. [1]: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods -David On Thu, Sep 9, 2021, at 17:01, Michael Ark wrote: > David, > > To follow up on something you mentioned. > > > In fact, you probably should avoid fork() and things built on top of it, > > like the multiprocessing module - it will not play well with the C++-level > > libraries. > > Why would this be a problem, and how would a user of Flight know about this? > What would not playing well look like? Why would something like > ProcessPoolExecutor, which would work only on pickled objects and not part of > the shared state, cause problems? Does returning a Futures object for a > completely independent process to the state of the server really cause an > issue? > > What is the best way to conduct a CPU-heavy process on Flight? > > Thanks, > Michael > > On Tue, Sep 7, 2021 at 4:46 PM Michael Ark <[email protected]> wrote: >> David, >> >> Thanks so much for breaking everything down in such a digestible way! This >> has helped hugely in my understanding of Arrow Flight and some general >> concurrency concepts with Python. >> >> Thank you sincerely! >> >> On Tue, Sep 7, 2021 at 4:39 PM David Li <[email protected]> wrote: >>> __ >>> Yes, that's correct - Flight is not really handling concurrency itself and >>> leaves that to gRPC, so you could study gRPC itself instead. (That said, >>> there are likely more accessible frameworks available for study, including >>> some pure-Python implementations of gRPC.) This is in part because gRPC is >>> high level enough that there is no real need to manage concurrency >>> ourselves (though you can choose to do this in part by using the async >>> APIs), and in part because we would like to support things other than gRPC >>> in the future, so the code needs to be kept reasonably abstracted away from >>> those concerns. >>> >>> And yes, if you have shared mutable state in your Python process, other >>> threading primitives, like locks, are still useful, even if the threads are >>> not managed by Arrow/Python. Arrow, of course, does make sure to acquire >>> the GIL as required before calling into Python; this also means that >>> despite handling the bulk of a request in C++, there will naturally be a >>> limit to the scalability of a single-process Flight server written in >>> Python. >>> >>> I think I just saw a StackOverflow question from you? In which case I'll >>> mention again that the multiprocessing module is generally not safe to use >>> with Flight, in this case because gRPC has process-global state that needs >>> to be taken care of before/after fork(), and we do not handle this case >>> currently. (Note this is in contrast to the grpcio module/gRPC *for >>> Python*, which is a *separate* implementation of gRPC and does take care of >>> this!) >>> >>>> I presume Flight is able to handle all of this on a single process because >>>> it isn’t necessarily a CPU-intensive process with just pushing and pulling >>>> data, which would primarily be network transfers—is that right? >>> >>> A couple things here. Ultimately, for a Python service, you are still bound >>> by the GIL. But since most of the code is in C++ and isn't calling into >>> Python, the GIL is less limiting. For instance, if you return a >>> RecordBatchStream from a DoGet endpoint, that contains a reference to Arrow >>> data, which will be extracted and sent over the wire in C++ without further >>> interaction from Python, leaving the Python interpreter free to handle >>> another request while the C++ code takes care of the I/O. And yes, Flight >>> services are presumably I/O bound more than CPU bound, and Flight contains >>> some optimizations to help ensure that by reducing or eliminating copying >>> of data where possible. >>> >>> >>> -David >>> >>> On Tue, Sep 7, 2021, at 19:24, Michael Ark wrote: >>>> David, >>>> >>>> Thanks, that’s very helpful. I had suspected as much as I began to dig >>>> into the code. I’m rather weak with concurrency and would like to see how >>>> Arrow Flight is handling every request it gets. Are you suggesting that >>>> even for Arrow Flight, it’s under-the-hood and the concurrency is actually >>>> specific to gRPC—meaning that if I look through Apache Arrow’s source >>>> code, the threading logic would be abstracted to the gRPC dependency? >>>> >>>> Presumably, this means that if I have stateful variables on my running >>>> server in Python, I need to manage my own locks to ensure my data >>>> structures are thread safe, though the actual management of threads would >>>> be much farther upstream? >>>> >>>> On Tue, Sep 7, 2021 at 4:18 PM David Li <[email protected]> wrote: >>>>> __ >>>>> Hey Michael, >>>>> >>>>> The key thing is that the threads are all managed by gRPC's C++ >>>>> implementation. On the server side, the C++ libraries underneath handle >>>>> incoming requests, encoding/decoding responses, etc. all concurrently >>>>> without calling into Python. Arrow calls into Python only for the actual >>>>> RPC endpoint logic. This is all multithreaded and within a single >>>>> process. (In fact, you probably should avoid fork() and things built on >>>>> top of it, like the multiprocessing module - it will not play well with >>>>> the C++-level libraries.) Threading is all managed by the C++ library and >>>>> so there is not any one place to look at, is there something specific you >>>>> were looking for? >>>>> >>>>> Best, >>>>> David >>>>> >>>>> On Tue, Sep 7, 2021, at 18:45, Michael Ark wrote: >>>>>> I am relatively new to Python and Arrow Flight. I want to understand how >>>>>> Arrow Flight works with multiple clients making multiple requests to a >>>>>> single server. It seems like Arrow Flight handles concurrency. Is it >>>>>> multithreaded, but single process? How are the threads managed? Where >>>>>> can I find this logic? When I try to track the threads in the server >>>>>> with logging, I get DummyThreads, so it’s not very helpful. >>>>>> >>>>>> #arrow-flight >>>>>> >>>>>> Thanks! Appreciate any help you can provide. >>>>> >>>
