Re: [Python] [Flight] How does concurrency in Arrow Flight work? Where in the source code can I find it?

David Li Thu, 09 Sep 2021 14:56:08 -0700

Essentially, your process is liable to crash. This has nothing to do with 
Python; it's because the gRPC C++ library initializes some process-global state 
(including some locks), which get invalidated after fork(), and when gRPC goes 
to use that state afterwards, it'll crash. Using pickle, Future, or whatever 
doesn't affect this; as soon as you fork() after using gRPC, your new process 
is irrevocably doomed.


There's some vague references to this in the gRPC docs, along with things that 
can go wrong: https://github.com/grpc/grpc/blob/master/doc/fork_support.md Note 
that the support referenced there only applies to the Python library "grpcio", 
so as far as PyArrow/Flight is concerned, you absolutely should not fork()/use 
multiprocessing. It's a little confusing: gRPC has a "Core" library, upon which 
the C++ library (used by Flight, including in PyArrow) and the Python library 
("grpcio", not used by Arrow, including in PyArrow) are built. "grpcio" has 
special support for fork(), but gRPC/C++ (and hence Flight) does not.

You *may* be able to get away with using multiprocessing if the child processes 
never touch gRPC/Flight (that said, we don't test this at all). Or you can 
start a multiprocessing pool before importing Flight, and/or use the 'spawn' 
method[1] ('forkserver' may also work), so that the child processes don't have 
any gRPC state.

[1]: 
https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

-David

On Thu, Sep 9, 2021, at 17:01, Michael Ark wrote:
> David,
> 
> To follow up on something you mentioned.
> 
> > In fact, you probably should avoid fork() and things built on top of it, 
> > like the multiprocessing module - it will not play well with the C++-level 
> > libraries.
> 
> Why would this be a problem, and how would a user of Flight know about this? 
> What would not playing well look like? Why would something like 
> ProcessPoolExecutor, which would work only on pickled objects and not part of 
> the shared state, cause problems? Does returning a Futures object for a 
> completely independent process to the state of the server really cause an 
> issue? 
> 
> What is the best way to conduct a CPU-heavy process on Flight?
> 
> Thanks,
> Michael
> 
> On Tue, Sep 7, 2021 at 4:46 PM Michael Ark <[email protected]> wrote:
>> David,
>> 
>> Thanks so much for breaking everything down in such a digestible way! This 
>> has helped hugely in my understanding of Arrow Flight and some general 
>> concurrency concepts with Python.
>> 
>> Thank you sincerely!
>> 
>> On Tue, Sep 7, 2021 at 4:39 PM David Li <[email protected]> wrote:
>>> __
>>> Yes, that's correct - Flight is not really handling concurrency itself and 
>>> leaves that to gRPC, so you could study gRPC itself instead. (That said, 
>>> there are likely more accessible frameworks available for study, including 
>>> some pure-Python implementations of gRPC.) This is in part because gRPC is 
>>> high level enough that there is no real need to manage concurrency 
>>> ourselves (though you can choose to do this in part by using the async 
>>> APIs), and in part because we would like to support things other than gRPC 
>>> in the future, so the code needs to be kept reasonably abstracted away from 
>>> those concerns.
>>> 
>>> And yes, if you have shared mutable state in your Python process, other 
>>> threading primitives, like locks, are still useful, even if the threads are 
>>> not managed by Arrow/Python. Arrow, of course, does make sure to acquire 
>>> the GIL as required before calling into Python; this also means that 
>>> despite handling the bulk of a request in C++, there will naturally be a 
>>> limit to the scalability of a single-process Flight server written in 
>>> Python.
>>> 
>>> I think I just saw a StackOverflow question from you? In which case I'll 
>>> mention again that the multiprocessing module is generally not safe to use 
>>> with Flight, in this case because gRPC has process-global state that needs 
>>> to be taken care of before/after fork(), and we do not handle this case 
>>> currently. (Note this is in contrast to the grpcio module/gRPC *for 
>>> Python*, which is a *separate* implementation of gRPC and does take care of 
>>> this!)
>>> 
>>>> I presume Flight is able to handle all of this on a single process because 
>>>> it isn’t necessarily a CPU-intensive process with just pushing and pulling 
>>>> data, which would primarily be network transfers—is that right?
>>> 
>>> A couple things here. Ultimately, for a Python service, you are still bound 
>>> by the GIL. But since most of the code is in C++ and isn't calling into 
>>> Python, the GIL is less limiting. For instance, if you return a 
>>> RecordBatchStream from a DoGet endpoint, that contains a reference to Arrow 
>>> data, which will be extracted and sent over the wire in C++ without further 
>>> interaction from Python, leaving the Python interpreter free to handle 
>>> another request while the C++ code takes care of the I/O. And yes, Flight 
>>> services are presumably I/O bound more than CPU bound, and Flight contains 
>>> some optimizations to help ensure that by reducing or eliminating copying 
>>> of data where possible.
>>> 
>>> 
>>> -David
>>> 
>>> On Tue, Sep 7, 2021, at 19:24, Michael Ark wrote:
>>>> David,
>>>> 
>>>> Thanks, that’s very helpful. I had suspected as much as I began to dig 
>>>> into the code. I’m rather weak with concurrency and would like to see how 
>>>> Arrow Flight is handling every request it gets. Are you suggesting that 
>>>> even for Arrow Flight, it’s under-the-hood and the concurrency is actually 
>>>> specific to gRPC—meaning that if I look through Apache Arrow’s source 
>>>> code, the threading logic would be abstracted to the gRPC dependency?
>>>> 
>>>> Presumably, this means that if I have stateful variables on my running 
>>>> server in Python, I need to manage my own locks to ensure my data 
>>>> structures are thread safe, though the actual management of threads would 
>>>> be much farther upstream?
>>>> 
>>>> On Tue, Sep 7, 2021 at 4:18 PM David Li <[email protected]> wrote:
>>>>> __
>>>>> Hey Michael,
>>>>> 
>>>>> The key thing is that the threads are all managed by gRPC's C++ 
>>>>> implementation. On the server side, the C++ libraries underneath handle 
>>>>> incoming requests, encoding/decoding responses, etc. all concurrently 
>>>>> without calling into Python. Arrow calls into Python only for the actual 
>>>>> RPC endpoint logic. This is all multithreaded and within a single 
>>>>> process. (In fact, you probably should avoid fork() and things built on 
>>>>> top of it, like the multiprocessing module - it will not play well with 
>>>>> the C++-level libraries.) Threading is all managed by the C++ library and 
>>>>> so there is not any one place to look at, is there something specific you 
>>>>> were looking for?
>>>>> 
>>>>> Best,
>>>>> David
>>>>> 
>>>>> On Tue, Sep 7, 2021, at 18:45, Michael Ark wrote:
>>>>>> I am relatively new to Python and Arrow Flight. I want to understand how 
>>>>>> Arrow Flight works with multiple clients making multiple requests to a 
>>>>>> single server. It seems like Arrow Flight handles concurrency. Is it 
>>>>>> multithreaded, but single process? How are the threads managed? Where 
>>>>>> can I find this logic? When I try to track the threads in the server 
>>>>>> with logging, I get DummyThreads, so it’s not very helpful.
>>>>>> 
>>>>>> #arrow-flight
>>>>>> 
>>>>>> Thanks! Appreciate any help you can provide.
>>>>> 
>>>

Re: [Python] [Flight] How does concurrency in Arrow Flight work? Where in the source code can I find it?

Reply via email to