Hi Lubo,

Yes, opening a file (whether locally or on S3 or elsewhere) does have to do 
some work up front to read the file footer, load the schema, etc. before any 
batches can be read (See ARROW-15340 for an optimization that would cut down on 
the number of I/O operations here.) It should generally be minor overhead (and 
as always, profile/benchmark your particular use case), but it's best not to 
repeatedly close/re-open the reader if you can help it.

For streaming data from a record batch reader in DoGet, use RecordBatchStream: 
https://arrow.apache.org/docs/python/generated/pyarrow.flight.RecordBatchStream.html#pyarrow.flight.RecordBatchStream
 So long as the record batch reader itself is not implemented in Python, this 
will bypass the GIL/will not call back into Python for each batch. (It will 
extract the underlying C++ RecordBatchReader object from the Python reader, and 
use that directly.)

Ah, though: is the issue that RecordBatchFileReader is a distinct interface 
from RecordBatchReader? We should indeed fix that…

-David

On Thu, Mar 17, 2022, at 17:44, Lubomir Slivka wrote:
> Hi All,
> 
> I'm building a server with Flight RPC to store data and then allow 
> GetFlightInfo->DoGet of either the entire data or just particular record 
> batches. The server serializes data into the File format / random access 
> format to allow efficient get of the particular record batches.
> 
> I have a couple of questions in the area of reading the data using 
> RecordBatchFileReader.
> 
> ---
> 
> Skimming through the underlying C++ implementation, I see that the 
> RecordBatchFileReader caches 'some stuff' internally. 
> 
> >> This makes me wonder, what is the recommendation for the lifecycle of the 
> >> RecordBatchFileReader instances? Is it a good idea to keep a single 
> >> instance of RecordBatchFileReader open as long as possible? 
> 
> *My guess is yes especially when working with files on S3 this could cut down 
> some network calls. *
> 
> ---
> 
> Now the other thing is... when I'm trying to send out all data as response to 
> DoGet, it seems that the only way to do so is to create GeneratorStream and 
> pass a generator that yields  0..num_record_batches from RecordBatchFileReader
> 
> >> Is the GeneratorStream the only way to stream out all data that is 
> >> serialized in IPC file format? Perhaps I missed something?
> 
> If the GeneratorStream is the right answer, my concern here is that, for 
> every batch the C++ code will be calling 'up' to Python, grabbing GIL on the 
> way, getting the batch, then back to C++. 
> 
> How about having a RecordBatchReader that would one-time read 
> 0..num_record_batches from RecordBatchFileReader? I have seen a couple of 
> adapters into RecordBatchReader elsewhere in the codebase so perhaps it's not 
> totally off?  If the adapter would be 'end-to-end' (new RecordBatchReader 
> implementation in C++ , wrapped using Cython), then passing such a thing to 
> `pyarrow.flight.RecordBatchStream` would mean all the sending can be done in 
> C++. Plus I think it makes some sense to have a common interface to read all 
> batches regardless of the format (stream vs random access). What do you think?
> 
> *Note: I have experimented with building this but I think I hit a wall. The 
> work in C++ seems straightforward (I have found few other places where 
> adapters to RecordBatchReader are done, so that was good inspiration). 
> However I run into problems in Cython because the RecordBatchFileReader is 
> only defined in ipc.pxi and does not have `cdef class` block anywhere in 
> *.pxd files. And so - as far as my cython experience goes - the extension 
> cannot get a hold of the RecordBatchFileReader's underlying C++ instance. I'm 
> very new to all the cython and Python/C++ extensions, so perhaps there is 
> some other way? Of course I can still build MyOwnRecordBatchFileReader in 
> cython but I would rather play with PyArrow types.*
> 
> Thank you and best regards,
> Lubo Slivka
> 

Reply via email to