David,

Thanks so much for breaking everything down in such a digestible way! This
has helped hugely in my understanding of Arrow Flight and some general
concurrency concepts with Python.

Thank you sincerely!

On Tue, Sep 7, 2021 at 4:39 PM David Li <[email protected]> wrote:

> Yes, that's correct - Flight is not really handling concurrency itself and
> leaves that to gRPC, so you could study gRPC itself instead. (That said,
> there are likely more accessible frameworks available for study, including
> some pure-Python implementations of gRPC.) This is in part because gRPC is
> high level enough that there is no real need to manage concurrency
> ourselves (though you can choose to do this in part by using the async
> APIs), and in part because we would like to support things other than gRPC
> in the future, so the code needs to be kept reasonably abstracted away from
> those concerns.
>
> And yes, if you have shared mutable state in your Python process, other
> threading primitives, like locks, are still useful, even if the threads are
> not managed by Arrow/Python. Arrow, of course, does make sure to acquire
> the GIL as required before calling into Python; this also means that
> despite handling the bulk of a request in C++, there will naturally be a
> limit to the scalability of a single-process Flight server written in
> Python.
>
> I think I just saw a StackOverflow question from you? In which case I'll
> mention again that the multiprocessing module is generally not safe to use
> with Flight, in this case because gRPC has process-global state that needs
> to be taken care of before/after fork(), and we do not handle this case
> currently. (Note this is in contrast to the grpcio module/gRPC *for
> Python*, which is a *separate* implementation of gRPC and does take care
> of this!)
>
> I presume Flight is able to handle all of this on a single process because
> it isn’t necessarily a CPU-intensive process with just pushing and pulling
> data, which would primarily be network transfers—is that right?
>
>
> A couple things here. Ultimately, for a Python service, you are still
> bound by the GIL. But since most of the code is in C++ and isn't calling
> into Python, the GIL is less limiting. For instance, if you return a
> RecordBatchStream from a DoGet endpoint, that contains a reference to Arrow
> data, which will be extracted and sent over the wire in C++ without further
> interaction from Python, leaving the Python interpreter free to handle
> another request while the C++ code takes care of the I/O. And yes, Flight
> services are presumably I/O bound more than CPU bound, and Flight contains
> some optimizations to help ensure that by reducing or eliminating copying
> of data where possible.
>
>
> -David
>
> On Tue, Sep 7, 2021, at 19:24, Michael Ark wrote:
>
> David,
>
> Thanks, that’s very helpful. I had suspected as much as I began to dig
> into the code. I’m rather weak with concurrency and would like to see how
> Arrow Flight is handling every request it gets. Are you suggesting that
> even for Arrow Flight, it’s under-the-hood and the concurrency is actually
> specific to gRPC—meaning that if I look through Apache Arrow’s source code,
> the threading logic would be abstracted to the gRPC dependency?
>
> Presumably, this means that if I have stateful variables on my running
> server in Python, I need to manage my own locks to ensure my data
> structures are thread safe, though the actual management of threads would
> be much farther upstream?
>
> On Tue, Sep 7, 2021 at 4:18 PM David Li <[email protected]> wrote:
>
>
> Hey Michael,
>
> The key thing is that the threads are all managed by gRPC's C++
> implementation. On the server side, the C++ libraries underneath handle
> incoming requests, encoding/decoding responses, etc. all concurrently
> without calling into Python. Arrow calls into Python only for the actual
> RPC endpoint logic. This is all multithreaded and within a single process.
> (In fact, you probably should avoid fork() and things built on top of it,
> like the multiprocessing module - it will not play well with the C++-level
> libraries.) Threading is all managed by the C++ library and so there is not
> any one place to look at, is there something specific you were looking for?
>
> Best,
> David
>
> On Tue, Sep 7, 2021, at 18:45, Michael Ark wrote:
>
> I am relatively new to Python and Arrow Flight. I want to understand how
> Arrow Flight works with multiple clients making multiple requests to a
> single server. It seems like Arrow Flight handles concurrency. Is it
> multithreaded, but single process? How are the threads managed? Where can I
> find this logic? When I try to track the threads in the server with
> logging, I get DummyThreads, so it’s not very helpful.
>
> #arrow-flight
>
> Thanks! Appreciate any help you can provide.
>
>
>
>

Reply via email to