Re: [C++] [Arrow IPC] Efficient Multiple Reads

Weston Pace Wed, 08 Nov 2023 09:45:37 -0800

You are correct that there is no existing capability to create an IPC
reader with precomputed metadata.  I don't think anyone would be opposed to
this feature, it just hasn't been a priority.


If you wanted to avoid changing arrow then you could create your own
implementation of `RandomAccessFile` which is partially backed by an
in-memory buffer and fetches from file when the reads go out of the
buffered range.  However, I'm not sure that I/O is the culprit.  Are you
reading from a local file?  If so, then the future reads would probably
already be cached by the OS (unless maybe you are under memory pressure).

Perhaps it is the CPU cost of processing the metadata that is slowing down
your reads.  If that is the case then I think a code change is inevitable.


On Wed, Nov 8, 2023 at 6:43 AM Ishbir Singh <[email protected]> wrote:

> Thanks for the info, Aldrin. I forgot to mention that the subsets of
> columns chosen is dynamic. Basically, I have a web server serving columns
> from the file. It wouldn’t make sense to rewrite the files for each query.
>
> I’m just looking for the easiest way to read the metadata as a buffer so I
> can pass it to the function below because I believe that should accomplish
> what I want.
>
> Thanks,
> Ishbir Singh
>
> W dniu wt., 7.11.2023 o 20:56 Aldrin <[email protected]> napisał(a):
>
>> In a probably too short answer, I think you want to do one of the
>> following:
>>
>> - write a single feather file with many batches
>> - write many feather files but using the dataset API to hopefully have
>> arrow do some multi-file optimization for you (and hopefully still have
>> multiple batches per file)
>> - write the schema in one file (or as few files as there are schemas) and
>> write many (N) recordbatches to fewer files (M) using the stream interface
>> (instead of file)
>>
>> I do the 3rd one and I do it because I made assumptions about data
>> accesses but I have not validated those assumptions. The main assumption
>> being that writing a RecordBatch with the stream API is not rewriting the
>> schema each time (or having equivalent amplification on the read side).
>>
>> Let me know if there's any approach you want more info on and I can
>> follow up or maybe someone else can chime in/correct me.
>>
>> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>>
>>
>> On Tue, Nov 7, 2023 at 10:27, Ishbir Singh <[email protected]
>> <On+Tue,+Nov+7,+2023+at+10:27,+Ishbir+Singh+%3C%3Ca+href=>> wrote:
>>
>> Apologies if this is the wrong place for this, but I'm looking to
>> repeatedly select a subset of columns from a wide feather file (which has
>> ~200k columns). What I find is that if I use RecordBatchReader::Open with
>> the requisite arguments asking it to select the particular columns, it
>> reads the schema over and over (once per Open call). Now that is to be
>> expected as there doesn't seem to be a way to pass a pre-existing schema.
>>
>> However, in my use case, I want the smaller queries to be fast and can't
>> have it re-parse the schema for every call. The input file thus has to be a
>> io::RandomAccesssFile. Looking at arrow/ipc/reader.h, the only method that
>> can serve this purpose seems to be:
>>
>> Result<std::shared_ptr<RecordBatch>> ReadRecordBatch(
>> const Buffer& metadata, const std::shared_ptr<Schema>& schema,
>> const DictionaryMemo* dictionary_memo, const IpcReadOptions& options,
>> io::RandomAccessFile* file);
>>
>> How do I efficiently read the file once to get the schema and metadata in
>> this case? My file does not have any dictionaries. Am I thinking about this
>> incorrectly?
>>
>> Would appreciate any pointers.
>>
>> Thanks,
>> Ishbir Singh
>>
>>

Re: [C++] [Arrow IPC] Efficient Multiple Reads

Reply via email to