Yes, please file a Jira, it would be good to have a conversion for this use case.
On Fri, Mar 18, 2022, at 03:43, Lubomir Slivka wrote: > Hi David, > > Thanks for confirmation on the file reader instance lifecycle. > > Indeed, the issue is that RecordBatchFileReader is a distinct interface from > RecordBatchReader - which imho is understandable as file reader can do > repeated, random access while RecordBatchReader is about getting batches in > order, one-by-one, once. I guess what is missing is a method like > RecordBatchFileReader.to_reader() that would return the RecordBatchReader to > read all batches from the underlying instance once. > > *For sakes of completeness: before I got the requirement to serve random > batches of data, the server was storing data in IPC stream format and then > used the pyarrow.flight.RecordBatchStream with great success.* > > How to proceed from there? Shall I open a JIRA ticket or..? > > Thanks again, > Lubo > > > On Thu, Mar 17, 2022 at 11:22 PM David Li <[email protected]> wrote: >> __ >> Hi Lubo, >> >> Yes, opening a file (whether locally or on S3 or elsewhere) does have to do >> some work up front to read the file footer, load the schema, etc. before any >> batches can be read (See ARROW-15340 for an optimization that would cut down >> on the number of I/O operations here.) It should generally be minor overhead >> (and as always, profile/benchmark your particular use case), but it's best >> not to repeatedly close/re-open the reader if you can help it. >> >> For streaming data from a record batch reader in DoGet, use >> RecordBatchStream: >> https://arrow.apache.org/docs/python/generated/pyarrow.flight.RecordBatchStream.html#pyarrow.flight.RecordBatchStream >> So long as the record batch reader itself is not implemented in Python, >> this will bypass the GIL/will not call back into Python for each batch. (It >> will extract the underlying C++ RecordBatchReader object from the Python >> reader, and use that directly.) >> >> Ah, though: is the issue that RecordBatchFileReader is a distinct interface >> from RecordBatchReader? We should indeed fix that… >> >> -David >> >> On Thu, Mar 17, 2022, at 17:44, Lubomir Slivka wrote: >>> Hi All, >>> >>> I'm building a server with Flight RPC to store data and then allow >>> GetFlightInfo->DoGet of either the entire data or just particular record >>> batches. The server serializes data into the File format / random access >>> format to allow efficient get of the particular record batches. >>> >>> I have a couple of questions in the area of reading the data using >>> RecordBatchFileReader. >>> >>> --- >>> >>> Skimming through the underlying C++ implementation, I see that the >>> RecordBatchFileReader caches 'some stuff' internally. >>> >>> >> This makes me wonder, what is the recommendation for the lifecycle of >>> >> the RecordBatchFileReader instances? Is it a good idea to keep a single >>> >> instance of RecordBatchFileReader open as long as possible? >>> >>> *My guess is yes especially when working with files on S3 this could cut >>> down some network calls. * >>> >>> --- >>> >>> Now the other thing is... when I'm trying to send out all data as response >>> to DoGet, it seems that the only way to do so is to create GeneratorStream >>> and pass a generator that yields 0..num_record_batches from >>> RecordBatchFileReader >>> >>> >> Is the GeneratorStream the only way to stream out all data that is >>> >> serialized in IPC file format? Perhaps I missed something? >>> >>> If the GeneratorStream is the right answer, my concern here is that, for >>> every batch the C++ code will be calling 'up' to Python, grabbing GIL on >>> the way, getting the batch, then back to C++. >>> >>> How about having a RecordBatchReader that would one-time read >>> 0..num_record_batches from RecordBatchFileReader? I have seen a couple of >>> adapters into RecordBatchReader elsewhere in the codebase so perhaps it's >>> not totally off? If the adapter would be 'end-to-end' (new >>> RecordBatchReader implementation in C++ , wrapped using Cython), then >>> passing such a thing to `pyarrow.flight.RecordBatchStream` would mean all >>> the sending can be done in C++. Plus I think it makes some sense to have a >>> common interface to read all batches regardless of the format (stream vs >>> random access). What do you think? >>> >>> *Note: I have experimented with building this but I think I hit a wall. The >>> work in C++ seems straightforward (I have found few other places where >>> adapters to RecordBatchReader are done, so that was good inspiration). >>> However I run into problems in Cython because the RecordBatchFileReader is >>> only defined in ipc.pxi and does not have `cdef class` block anywhere in >>> *.pxd files. And so - as far as my cython experience goes - the extension >>> cannot get a hold of the RecordBatchFileReader's underlying C++ instance. >>> I'm very new to all the cython and Python/C++ extensions, so perhaps there >>> is some other way? Of course I can still build MyOwnRecordBatchFileReader >>> in cython but I would rather play with PyArrow types.* >>> >>> Thank you and best regards, >>> Lubo Slivka >>> >>
