Re: [Python] [Flight] How does concurrency in Arrow Flight work? Where in the source code can I find it?

Michael Ark Thu, 09 Sep 2021 16:38:01 -0700

That’s quite a rabbit hole I’ve uncovered. This is super helpful and thank
you for the cookie crumbs, I will dig deeper into these items. I definitely
don’t think I would have figured out the differences for the gRPC versions.


> Or you can start a multiprocessing pool before importing Flight

I get that forking before the gRPC process is invoked MIGHT circumvent the
problem you’ve described, but why does it have to be before **importing**?
Would it not have to take place before the server begins, which would be
after the flight server starts up and begins running/serving?

Thank you for your time!

On Thu, Sep 9, 2021 at 2:56 PM David Li <[email protected]> wrote:

> Essentially, your process is liable to crash. This has nothing to do with
> Python; it's because the gRPC C++ library initializes some process-global
> state (including some locks), which get invalidated after fork(), and when
> gRPC goes to use that state afterwards, it'll crash. Using pickle, Future,
> or whatever doesn't affect this; as soon as you fork() after using gRPC,
> your new process is irrevocably doomed.
>
> There's some vague references to this in the gRPC docs, along with things
> that can go wrong:
> https://github.com/grpc/grpc/blob/master/doc/fork_support.md Note that
> the support referenced there only applies to the Python library "grpcio",
> so as far as PyArrow/Flight is concerned, you absolutely should not
> fork()/use multiprocessing. It's a little confusing: gRPC has a "Core"
> library, upon which the C++ library (used by Flight, including in PyArrow)
> and the Python library ("grpcio", not used by Arrow, including in PyArrow)
> are built. "grpcio" has special support for fork(), but gRPC/C++ (and hence
> Flight) does not.
>
> You *may* be able to get away with using multiprocessing if the child
> processes never touch gRPC/Flight (that said, we don't test this at all).
> Or you can start a multiprocessing pool before importing Flight, and/or use
> the 'spawn' method[1] ('forkserver' may also work), so that the child
> processes don't have any gRPC state.
>
> [1]:
> https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
>
> -David
>
> On Thu, Sep 9, 2021, at 17:01, Michael Ark wrote:
>
> David,
>
> To follow up on something you mentioned.
>
> > In fact, you probably should avoid fork() and things built on top of it,
> like the multiprocessing module - it will not play well with the C++-level
> libraries.
>
> Why would this be a problem, and how would a user of Flight know about
> this? What would not playing well look like? Why would something like
> ProcessPoolExecutor, which would work only on pickled objects and not part
> of the shared state, cause problems? Does returning a Futures object for a
> completely independent process to the state of the server really cause an
> issue?
>
> What is the best way to conduct a CPU-heavy process on Flight?
>
> Thanks,
> Michael
>
> On Tue, Sep 7, 2021 at 4:46 PM Michael Ark <[email protected]>
> wrote:
>
> David,
>
> Thanks so much for breaking everything down in such a digestible way! This
> has helped hugely in my understanding of Arrow Flight and some general
> concurrency concepts with Python.
>
> Thank you sincerely!
>
> On Tue, Sep 7, 2021 at 4:39 PM David Li <[email protected]> wrote:
>
>
> Yes, that's correct - Flight is not really handling concurrency itself and
> leaves that to gRPC, so you could study gRPC itself instead. (That said,
> there are likely more accessible frameworks available for study, including
> some pure-Python implementations of gRPC.) This is in part because gRPC is
> high level enough that there is no real need to manage concurrency
> ourselves (though you can choose to do this in part by using the async
> APIs), and in part because we would like to support things other than gRPC
> in the future, so the code needs to be kept reasonably abstracted away from
> those concerns.
>
> And yes, if you have shared mutable state in your Python process, other
> threading primitives, like locks, are still useful, even if the threads are
> not managed by Arrow/Python. Arrow, of course, does make sure to acquire
> the GIL as required before calling into Python; this also means that
> despite handling the bulk of a request in C++, there will naturally be a
> limit to the scalability of a single-process Flight server written in
> Python.
>
> I think I just saw a StackOverflow question from you? In which case I'll
> mention again that the multiprocessing module is generally not safe to use
> with Flight, in this case because gRPC has process-global state that needs
> to be taken care of before/after fork(), and we do not handle this case
> currently. (Note this is in contrast to the grpcio module/gRPC *for
> Python*, which is a *separate* implementation of gRPC and does take care
> of this!)
>
> I presume Flight is able to handle all of this on a single process because
> it isn’t necessarily a CPU-intensive process with just pushing and pulling
> data, which would primarily be network transfers—is that right?
>
>
> A couple things here. Ultimately, for a Python service, you are still
> bound by the GIL. But since most of the code is in C++ and isn't calling
> into Python, the GIL is less limiting. For instance, if you return a
> RecordBatchStream from a DoGet endpoint, that contains a reference to Arrow
> data, which will be extracted and sent over the wire in C++ without further
> interaction from Python, leaving the Python interpreter free to handle
> another request while the C++ code takes care of the I/O. And yes, Flight
> services are presumably I/O bound more than CPU bound, and Flight contains
> some optimizations to help ensure that by reducing or eliminating copying
> of data where possible.
>
>
> -David
>
> On Tue, Sep 7, 2021, at 19:24, Michael Ark wrote:
>
> David,
>
> Thanks, that’s very helpful. I had suspected as much as I began to dig
> into the code. I’m rather weak with concurrency and would like to see how
> Arrow Flight is handling every request it gets. Are you suggesting that
> even for Arrow Flight, it’s under-the-hood and the concurrency is actually
> specific to gRPC—meaning that if I look through Apache Arrow’s source code,
> the threading logic would be abstracted to the gRPC dependency?
>
> Presumably, this means that if I have stateful variables on my running
> server in Python, I need to manage my own locks to ensure my data
> structures are thread safe, though the actual management of threads would
> be much farther upstream?
>
> On Tue, Sep 7, 2021 at 4:18 PM David Li <[email protected]> wrote:
>
>
> Hey Michael,
>
> The key thing is that the threads are all managed by gRPC's C++
> implementation. On the server side, the C++ libraries underneath handle
> incoming requests, encoding/decoding responses, etc. all concurrently
> without calling into Python. Arrow calls into Python only for the actual
> RPC endpoint logic. This is all multithreaded and within a single process.
> (In fact, you probably should avoid fork() and things built on top of it,
> like the multiprocessing module - it will not play well with the C++-level
> libraries.) Threading is all managed by the C++ library and so there is not
> any one place to look at, is there something specific you were looking for?
>
> Best,
> David
>
> On Tue, Sep 7, 2021, at 18:45, Michael Ark wrote:
>
> I am relatively new to Python and Arrow Flight. I want to understand how
> Arrow Flight works with multiple clients making multiple requests to a
> single server. It seems like Arrow Flight handles concurrency. Is it
> multithreaded, but single process? How are the threads managed? Where can I
> find this logic? When I try to track the threads in the server with
> logging, I get DummyThreads, so it’s not very helpful.
>
> #arrow-flight
>
> Thanks! Appreciate any help you can provide.
>
>
>
>
>

Re: [Python] [Flight] How does concurrency in Arrow Flight work? Where in the source code can I find it?

Reply via email to