Ah, it just has to be before you do anything involving gRPC. The safest way to do that is to not import Flight (and hence load gRPC) in the first place. But right, I'm not sure if this is strictly required, or if it just has to be before you start the server. (Also, I believe 'spawn' and 'forkserver' should sidestep all this by starting an entirely new Python process anyways.)
-David On Thu, Sep 9, 2021, at 19:37, Michael Ark wrote: > That’s quite a rabbit hole I’ve uncovered. This is super helpful and thank > you for the cookie crumbs, I will dig deeper into these items. I definitely > don’t think I would have figured out the differences for the gRPC versions. > > > Or you can start a multiprocessing pool before importing Flight > > I get that forking before the gRPC process is invoked MIGHT circumvent the > problem you’ve described, but why does it have to be before **importing**? > Would it not have to take place before the server begins, which would be > after the flight server starts up and begins running/serving? > > Thank you for your time! > > On Thu, Sep 9, 2021 at 2:56 PM David Li <[email protected]> wrote: >> __ >> Essentially, your process is liable to crash. This has nothing to do with >> Python; it's because the gRPC C++ library initializes some process-global >> state (including some locks), which get invalidated after fork(), and when >> gRPC goes to use that state afterwards, it'll crash. Using pickle, Future, >> or whatever doesn't affect this; as soon as you fork() after using gRPC, >> your new process is irrevocably doomed. >> >> There's some vague references to this in the gRPC docs, along with things >> that can go wrong: >> https://github.com/grpc/grpc/blob/master/doc/fork_support.md Note that the >> support referenced there only applies to the Python library "grpcio", so as >> far as PyArrow/Flight is concerned, you absolutely should not fork()/use >> multiprocessing. It's a little confusing: gRPC has a "Core" library, upon >> which the C++ library (used by Flight, including in PyArrow) and the Python >> library ("grpcio", not used by Arrow, including in PyArrow) are built. >> "grpcio" has special support for fork(), but gRPC/C++ (and hence Flight) >> does not. >> >> You *may* be able to get away with using multiprocessing if the child >> processes never touch gRPC/Flight (that said, we don't test this at all). Or >> you can start a multiprocessing pool before importing Flight, and/or use the >> 'spawn' method[1] ('forkserver' may also work), so that the child processes >> don't have any gRPC state. >> >> [1]: >> https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods >> >> -David >> >> On Thu, Sep 9, 2021, at 17:01, Michael Ark wrote: >>> David, >>> >>> To follow up on something you mentioned. >>> >>> > In fact, you probably should avoid fork() and things built on top of it, >>> > like the multiprocessing module - it will not play well with the >>> > C++-level libraries. >>> >>> Why would this be a problem, and how would a user of Flight know about >>> this? What would not playing well look like? Why would something like >>> ProcessPoolExecutor, which would work only on pickled objects and not part >>> of the shared state, cause problems? Does returning a Futures object for a >>> completely independent process to the state of the server really cause an >>> issue? >>> >>> What is the best way to conduct a CPU-heavy process on Flight? >>> >>> Thanks, >>> Michael >>> >>> On Tue, Sep 7, 2021 at 4:46 PM Michael Ark <[email protected]> wrote: >>>> David, >>>> >>>> Thanks so much for breaking everything down in such a digestible way! This >>>> has helped hugely in my understanding of Arrow Flight and some general >>>> concurrency concepts with Python. >>>> >>>> Thank you sincerely! >>>> >>>> On Tue, Sep 7, 2021 at 4:39 PM David Li <[email protected]> wrote: >>>>> __ >>>>> Yes, that's correct - Flight is not really handling concurrency itself >>>>> and leaves that to gRPC, so you could study gRPC itself instead. (That >>>>> said, there are likely more accessible frameworks available for study, >>>>> including some pure-Python implementations of gRPC.) This is in part >>>>> because gRPC is high level enough that there is no real need to manage >>>>> concurrency ourselves (though you can choose to do this in part by using >>>>> the async APIs), and in part because we would like to support things >>>>> other than gRPC in the future, so the code needs to be kept reasonably >>>>> abstracted away from those concerns. >>>>> >>>>> And yes, if you have shared mutable state in your Python process, other >>>>> threading primitives, like locks, are still useful, even if the threads >>>>> are not managed by Arrow/Python. Arrow, of course, does make sure to >>>>> acquire the GIL as required before calling into Python; this also means >>>>> that despite handling the bulk of a request in C++, there will naturally >>>>> be a limit to the scalability of a single-process Flight server written >>>>> in Python. >>>>> >>>>> I think I just saw a StackOverflow question from you? In which case I'll >>>>> mention again that the multiprocessing module is generally not safe to >>>>> use with Flight, in this case because gRPC has process-global state that >>>>> needs to be taken care of before/after fork(), and we do not handle this >>>>> case currently. (Note this is in contrast to the grpcio module/gRPC *for >>>>> Python*, which is a *separate* implementation of gRPC and does take care >>>>> of this!) >>>>> >>>>>> I presume Flight is able to handle all of this on a single process >>>>>> because it isn’t necessarily a CPU-intensive process with just pushing >>>>>> and pulling data, which would primarily be network transfers—is that >>>>>> right? >>>>> >>>>> A couple things here. Ultimately, for a Python service, you are still >>>>> bound by the GIL. But since most of the code is in C++ and isn't calling >>>>> into Python, the GIL is less limiting. For instance, if you return a >>>>> RecordBatchStream from a DoGet endpoint, that contains a reference to >>>>> Arrow data, which will be extracted and sent over the wire in C++ without >>>>> further interaction from Python, leaving the Python interpreter free to >>>>> handle another request while the C++ code takes care of the I/O. And yes, >>>>> Flight services are presumably I/O bound more than CPU bound, and Flight >>>>> contains some optimizations to help ensure that by reducing or >>>>> eliminating copying of data where possible. >>>>> >>>>> >>>>> -David >>>>> >>>>> On Tue, Sep 7, 2021, at 19:24, Michael Ark wrote: >>>>>> David, >>>>>> >>>>>> Thanks, that’s very helpful. I had suspected as much as I began to dig >>>>>> into the code. I’m rather weak with concurrency and would like to see >>>>>> how Arrow Flight is handling every request it gets. Are you suggesting >>>>>> that even for Arrow Flight, it’s under-the-hood and the concurrency is >>>>>> actually specific to gRPC—meaning that if I look through Apache Arrow’s >>>>>> source code, the threading logic would be abstracted to the gRPC >>>>>> dependency? >>>>>> >>>>>> Presumably, this means that if I have stateful variables on my running >>>>>> server in Python, I need to manage my own locks to ensure my data >>>>>> structures are thread safe, though the actual management of threads >>>>>> would be much farther upstream? >>>>>> >>>>>> On Tue, Sep 7, 2021 at 4:18 PM David Li <[email protected]> wrote: >>>>>>> __ >>>>>>> Hey Michael, >>>>>>> >>>>>>> The key thing is that the threads are all managed by gRPC's C++ >>>>>>> implementation. On the server side, the C++ libraries underneath handle >>>>>>> incoming requests, encoding/decoding responses, etc. all concurrently >>>>>>> without calling into Python. Arrow calls into Python only for the >>>>>>> actual RPC endpoint logic. This is all multithreaded and within a >>>>>>> single process. (In fact, you probably should avoid fork() and things >>>>>>> built on top of it, like the multiprocessing module - it will not play >>>>>>> well with the C++-level libraries.) Threading is all managed by the C++ >>>>>>> library and so there is not any one place to look at, is there >>>>>>> something specific you were looking for? >>>>>>> >>>>>>> Best, >>>>>>> David >>>>>>> >>>>>>> On Tue, Sep 7, 2021, at 18:45, Michael Ark wrote: >>>>>>>> I am relatively new to Python and Arrow Flight. I want to understand >>>>>>>> how Arrow Flight works with multiple clients making multiple requests >>>>>>>> to a single server. It seems like Arrow Flight handles concurrency. Is >>>>>>>> it multithreaded, but single process? How are the threads managed? >>>>>>>> Where can I find this logic? When I try to track the threads in the >>>>>>>> server with logging, I get DummyThreads, so it’s not very helpful. >>>>>>>> >>>>>>>> #arrow-flight >>>>>>>> >>>>>>>> Thanks! Appreciate any help you can provide. >>>>>>> >>>>> >>
