Thanks for the info, Aldrin. I forgot to mention that the subsets of columns chosen is dynamic. Basically, I have a web server serving columns from the file. It wouldn’t make sense to rewrite the files for each query.
I’m just looking for the easiest way to read the metadata as a buffer so I can pass it to the function below because I believe that should accomplish what I want. Thanks, Ishbir Singh W dniu wt., 7.11.2023 o 20:56 Aldrin <[email protected]> napisał(a): > In a probably too short answer, I think you want to do one of the > following: > > - write a single feather file with many batches > - write many feather files but using the dataset API to hopefully have > arrow do some multi-file optimization for you (and hopefully still have > multiple batches per file) > - write the schema in one file (or as few files as there are schemas) and > write many (N) recordbatches to fewer files (M) using the stream interface > (instead of file) > > I do the 3rd one and I do it because I made assumptions about data > accesses but I have not validated those assumptions. The main assumption > being that writing a RecordBatch with the stream API is not rewriting the > schema each time (or having equivalent amplification on the read side). > > Let me know if there's any approach you want more info on and I can follow > up or maybe someone else can chime in/correct me. > > Sent from Proton Mail <https://proton.me/mail/home> for iOS > > > On Tue, Nov 7, 2023 at 10:27, Ishbir Singh <[email protected] > <On+Tue,+Nov+7,+2023+at+10:27,+Ishbir+Singh+%3C%3Ca+href=>> wrote: > > Apologies if this is the wrong place for this, but I'm looking to > repeatedly select a subset of columns from a wide feather file (which has > ~200k columns). What I find is that if I use RecordBatchReader::Open with > the requisite arguments asking it to select the particular columns, it > reads the schema over and over (once per Open call). Now that is to be > expected as there doesn't seem to be a way to pass a pre-existing schema. > > However, in my use case, I want the smaller queries to be fast and can't > have it re-parse the schema for every call. The input file thus has to be a > io::RandomAccesssFile. Looking at arrow/ipc/reader.h, the only method that > can serve this purpose seems to be: > > Result<std::shared_ptr<RecordBatch>> ReadRecordBatch( > const Buffer& metadata, const std::shared_ptr<Schema>& schema, > const DictionaryMemo* dictionary_memo, const IpcReadOptions& options, > io::RandomAccessFile* file); > > How do I efficiently read the file once to get the schema and metadata in > this case? My file does not have any dictionaries. Am I thinking about this > incorrectly? > > Would appreciate any pointers. > > Thanks, > Ishbir Singh > >
