Hi All, I'm building a server with Flight RPC to store data and then allow GetFlightInfo->DoGet of either the entire data or just particular record batches. The server serializes data into the File format / random access format to allow efficient get of the particular record batches.
I have a couple of questions in the area of reading the data using RecordBatchFileReader. --- Skimming through the underlying C++ implementation, I see that the RecordBatchFileReader caches 'some stuff' internally. >> This makes me wonder, what is the recommendation for the lifecycle of the RecordBatchFileReader instances? Is it a good idea to keep a single instance of RecordBatchFileReader open as long as possible? *My guess is yes especially when working with files on S3 this could cut down some network calls. * --- Now the other thing is... when I'm trying to send out all data as response to DoGet, it seems that the only way to do so is to create GeneratorStream and pass a generator that yields 0..num_record_batches from RecordBatchFileReader >> Is the GeneratorStream the only way to stream out all data that is serialized in IPC file format? Perhaps I missed something? If the GeneratorStream is the right answer, my concern here is that, for every batch the C++ code will be calling 'up' to Python, grabbing GIL on the way, getting the batch, then back to C++. How about having a RecordBatchReader that would one-time read 0..num_record_batches from RecordBatchFileReader? I have seen a couple of adapters into RecordBatchReader elsewhere in the codebase so perhaps it's not totally off? If the adapter would be 'end-to-end' (new RecordBatchReader implementation in C++ , wrapped using Cython), then passing such a thing to `pyarrow.flight.RecordBatchStream` would mean all the sending can be done in C++. Plus I think it makes some sense to have a common interface to read all batches regardless of the format (stream vs random access). What do you think? *Note: I have experimented with building this but I think I hit a wall. The work in C++ seems straightforward (I have found few other places where adapters to RecordBatchReader are done, so that was good inspiration). However I run into problems in Cython because the RecordBatchFileReader is only defined in ipc.pxi and does not have `cdef class` block anywhere in *.pxd files. And so - as far as my cython experience goes - the extension cannot get a hold of the RecordBatchFileReader's underlying C++ instance. I'm very new to all the cython and Python/C++ extensions, so perhaps there is some other way? Of course I can still build MyOwnRecordBatchFileReader in cython but I would rather play with PyArrow types.* Thank you and best regards, Lubo Slivka
