Re: [C++] [Arrow IPC] Efficient Multiple Reads

Ishbir Singh Wed, 08 Nov 2023 06:43:31 -0800

Thanks for the info, Aldrin. I forgot to mention that the subsets of
columns chosen is dynamic. Basically, I have a web server serving columns
from the file. It wouldn’t make sense to rewrite the files for each query.


I’m just looking for the easiest way to read the metadata as a buffer so I
can pass it to the function below because I believe that should accomplish
what I want.

Thanks,
Ishbir Singh

W dniu wt., 7.11.2023 o 20:56 Aldrin <[email protected]> napisał(a):

> In a probably too short answer, I think you want to do one of the
> following:
>
> - write a single feather file with many batches
> - write many feather files but using the dataset API to hopefully have
> arrow do some multi-file optimization for you (and hopefully still have
> multiple batches per file)
> - write the schema in one file (or as few files as there are schemas) and
> write many (N) recordbatches to fewer files (M) using the stream interface
> (instead of file)
>
> I do the 3rd one and I do it because I made assumptions about data
> accesses but I have not validated those assumptions. The main assumption
> being that writing a RecordBatch with the stream API is not rewriting the
> schema each time (or having equivalent amplification on the read side).
>
> Let me know if there's any approach you want more info on and I can follow
> up or maybe someone else can chime in/correct me.
>
> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>
>
> On Tue, Nov 7, 2023 at 10:27, Ishbir Singh <[email protected]
> <On+Tue,+Nov+7,+2023+at+10:27,+Ishbir+Singh+%3C%3Ca+href=>> wrote:
>
> Apologies if this is the wrong place for this, but I'm looking to
> repeatedly select a subset of columns from a wide feather file (which has
> ~200k columns). What I find is that if I use RecordBatchReader::Open with
> the requisite arguments asking it to select the particular columns, it
> reads the schema over and over (once per Open call). Now that is to be
> expected as there doesn't seem to be a way to pass a pre-existing schema.
>
> However, in my use case, I want the smaller queries to be fast and can't
> have it re-parse the schema for every call. The input file thus has to be a
> io::RandomAccesssFile. Looking at arrow/ipc/reader.h, the only method that
> can serve this purpose seems to be:
>
> Result<std::shared_ptr<RecordBatch>> ReadRecordBatch(
> const Buffer& metadata, const std::shared_ptr<Schema>& schema,
> const DictionaryMemo* dictionary_memo, const IpcReadOptions& options,
> io::RandomAccessFile* file);
>
> How do I efficiently read the file once to get the schema and metadata in
> this case? My file does not have any dictionaries. Am I thinking about this
> incorrectly?
>
> Would appreciate any pointers.
>
> Thanks,
> Ishbir Singh
>
>

Re: [C++] [Arrow IPC] Efficient Multiple Reads

Reply via email to