Re: [Python] RecordBatchFileReader usage considerations

David Li Fri, 18 Mar 2022 05:49:00 -0700

Yes, please file a Jira, it would be good to have a conversion for this use 
case.


On Fri, Mar 18, 2022, at 03:43, Lubomir Slivka wrote:
> Hi David,
> 
> Thanks for confirmation on the file reader instance lifecycle. 
> 
> Indeed, the issue is that RecordBatchFileReader is a distinct interface from 
> RecordBatchReader - which imho is understandable as file reader can do 
> repeated, random access while RecordBatchReader is about getting batches in 
> order, one-by-one, once. I guess what is missing is a method like 
> RecordBatchFileReader.to_reader() that would return the RecordBatchReader to 
> read all batches from the underlying instance once.
> 
> *For sakes of completeness: before I got the requirement to serve random 
> batches of data, the server was storing data in IPC stream format and then 
> used the pyarrow.flight.RecordBatchStream with great success.*
> 
> How to proceed from there? Shall I open a JIRA ticket or..?
> 
> Thanks again,
> Lubo
> 
> 
> On Thu, Mar 17, 2022 at 11:22 PM David Li <[email protected]> wrote:
>> __
>> Hi Lubo,
>> 
>> Yes, opening a file (whether locally or on S3 or elsewhere) does have to do 
>> some work up front to read the file footer, load the schema, etc. before any 
>> batches can be read (See ARROW-15340 for an optimization that would cut down 
>> on the number of I/O operations here.) It should generally be minor overhead 
>> (and as always, profile/benchmark your particular use case), but it's best 
>> not to repeatedly close/re-open the reader if you can help it.
>> 
>> For streaming data from a record batch reader in DoGet, use 
>> RecordBatchStream: 
>> https://arrow.apache.org/docs/python/generated/pyarrow.flight.RecordBatchStream.html#pyarrow.flight.RecordBatchStream
>>  So long as the record batch reader itself is not implemented in Python, 
>> this will bypass the GIL/will not call back into Python for each batch. (It 
>> will extract the underlying C++ RecordBatchReader object from the Python 
>> reader, and use that directly.)
>> 
>> Ah, though: is the issue that RecordBatchFileReader is a distinct interface 
>> from RecordBatchReader? We should indeed fix that…
>> 
>> -David
>> 
>> On Thu, Mar 17, 2022, at 17:44, Lubomir Slivka wrote:
>>> Hi All,
>>> 
>>> I'm building a server with Flight RPC to store data and then allow 
>>> GetFlightInfo->DoGet of either the entire data or just particular record 
>>> batches. The server serializes data into the File format / random access 
>>> format to allow efficient get of the particular record batches.
>>> 
>>> I have a couple of questions in the area of reading the data using 
>>> RecordBatchFileReader.
>>> 
>>> ---
>>> 
>>> Skimming through the underlying C++ implementation, I see that the 
>>> RecordBatchFileReader caches 'some stuff' internally. 
>>> 
>>> >> This makes me wonder, what is the recommendation for the lifecycle of 
>>> >> the RecordBatchFileReader instances? Is it a good idea to keep a single 
>>> >> instance of RecordBatchFileReader open as long as possible? 
>>> 
>>> *My guess is yes especially when working with files on S3 this could cut 
>>> down some network calls. *
>>> 
>>> ---
>>> 
>>> Now the other thing is... when I'm trying to send out all data as response 
>>> to DoGet, it seems that the only way to do so is to create GeneratorStream 
>>> and pass a generator that yields  0..num_record_batches from 
>>> RecordBatchFileReader
>>> 
>>> >> Is the GeneratorStream the only way to stream out all data that is 
>>> >> serialized in IPC file format? Perhaps I missed something?
>>> 
>>> If the GeneratorStream is the right answer, my concern here is that, for 
>>> every batch the C++ code will be calling 'up' to Python, grabbing GIL on 
>>> the way, getting the batch, then back to C++. 
>>> 
>>> How about having a RecordBatchReader that would one-time read 
>>> 0..num_record_batches from RecordBatchFileReader? I have seen a couple of 
>>> adapters into RecordBatchReader elsewhere in the codebase so perhaps it's 
>>> not totally off?  If the adapter would be 'end-to-end' (new 
>>> RecordBatchReader implementation in C++ , wrapped using Cython), then 
>>> passing such a thing to `pyarrow.flight.RecordBatchStream` would mean all 
>>> the sending can be done in C++. Plus I think it makes some sense to have a 
>>> common interface to read all batches regardless of the format (stream vs 
>>> random access). What do you think?
>>> 
>>> *Note: I have experimented with building this but I think I hit a wall. The 
>>> work in C++ seems straightforward (I have found few other places where 
>>> adapters to RecordBatchReader are done, so that was good inspiration). 
>>> However I run into problems in Cython because the RecordBatchFileReader is 
>>> only defined in ipc.pxi and does not have `cdef class` block anywhere in 
>>> *.pxd files. And so - as far as my cython experience goes - the extension 
>>> cannot get a hold of the RecordBatchFileReader's underlying C++ instance. 
>>> I'm very new to all the cython and Python/C++ extensions, so perhaps there 
>>> is some other way? Of course I can still build MyOwnRecordBatchFileReader 
>>> in cython but I would rather play with PyArrow types.*
>>> 
>>> Thank you and best regards,
>>> Lubo Slivka
>>> 
>>

Re: [Python] RecordBatchFileReader usage considerations

Reply via email to