Also, it seems I should clarify: I am talking about the write side because that determines the characteristics your read path will have. I'm not talking about writing files each time you do a query, I'm talking about write the files one time in a way that when you have a query you know you've optimized for certain read patterns.
# ------------------------------ # Aldrin https://github.com/drin/ https://gitlab.com/octalene https://keybase.io/octalene On Wednesday, November 8th, 2023 at 10:27, Aldrin <[email protected]> wrote: > well, [1] shows a way to get the metadata, but you'll have to follow the > function chain to figure out if there's a way to just get the metadata for a > RecordBatch without reading the data for it (I couldn't do it in ~5 min). > > > - I forgot to mention that the subsets of columns chosen is dynamic... It > wouldn’t make sense to rewrite the files for each query. > > > > I'm just talking about writing the files to be less wide, and/or writing > files that contain only the metadata and no actual data (schema and schema > metadata) to initialize a RecordBatchStreamReader [2] from. Once you > initialize a RecordBatchStreamReader, you can feed it binary data and the > process looks like a reader with a pre-existing schema but you're managing > the file access (so you have to be more intentional in your file management). > > To Weston's point, if you have wide feather files and processing the (schema, > recordbatch, or feather file metadata, I'm not sure in particular which one > you're both referring to) is costly then you probably need to change > something in your process to get speed-ups. > > > [1]: > https://github.com/apache/arrow/blob/main/cpp/src/arrow/ipc/reader.cc#L867-L876 > [2]: > https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc23RecordBatchStreamReaderE > > > > # ------------------------------ > > # Aldrin > > > https://github.com/drin/ > > https://gitlab.com/octalene > > https://keybase.io/octalene > > > On Wednesday, November 8th, 2023 at 09:45, Weston Pace > <[email protected]> wrote: > > > > You are correct that there is no existing capability to create an IPC > > reader with precomputed metadata. I don't think anyone would be opposed to > > this feature, it just hasn't been a priority. > > > > If you wanted to avoid changing arrow then you could create your own > > implementation of `RandomAccessFile` which is partially backed by an > > in-memory buffer and fetches from file when the reads go out of the > > buffered range. However, I'm not sure that I/O is the culprit. Are you > > reading from a local file? If so, then the future reads would probably > > already be cached by the OS (unless maybe you are under memory pressure). > > > > Perhaps it is the CPU cost of processing the metadata that is slowing down > > your reads. If that is the case then I think a code change is inevitable. > > > > > > On Wed, Nov 8, 2023 at 6:43 AM Ishbir Singh <[email protected]> wrote: > > > > > Thanks for the info, Aldrin. I forgot to mention that the subsets of > > > columns chosen is dynamic. Basically, I have a web server serving columns > > > from the file. It wouldn’t make sense to rewrite the files for each query. > > > > > > I’m just looking for the easiest way to read the metadata as a buffer so > > > I can pass it to the function below because I believe that should > > > accomplish what I want. > > > > > > Thanks, > > > Ishbir Singh > > > > > > W dniu wt., 7.11.2023 o 20:56 Aldrin <[email protected]> napisał(a): > > > > > > > In a probably too short answer, I think you want to do one of the > > > > following: > > > > > > > > - write a single feather file with many batches > > > > - write many feather files but using the dataset API to hopefully have > > > > arrow do some multi-file optimization for you (and hopefully still have > > > > multiple batches per file) > > > > - write the schema in one file (or as few files as there are schemas) > > > > and write many (N) recordbatches to fewer files (M) using the stream > > > > interface (instead of file) > > > > > > > > I do the 3rd one and I do it because I made assumptions about data > > > > accesses but I have not validated those assumptions. The main > > > > assumption being that writing a RecordBatch with the stream API is not > > > > rewriting the schema each time (or having equivalent amplification on > > > > the read side). > > > > > > > > Let me know if there's any approach you want more info on and I can > > > > follow up or maybe someone else can chime in/correct me. > > > > > > > > Sent from Proton Mail for iOS > > > > > > > > > > > > On Tue, Nov 7, 2023 at 10:27, Ishbir Singh <[email protected]> wrote: > > > > > > > > > Apologies if this is the wrong place for this, but I'm looking to > > > > > repeatedly select a subset of columns from a wide feather file (which > > > > > has ~200k columns). What I find is that if I use > > > > > RecordBatchReader::Open with the requisite arguments asking it to > > > > > select the particular columns, it reads the schema over and over > > > > > (once per Open call). Now that is to be expected as there doesn't > > > > > seem to be a way to pass a pre-existing schema. > > > > > > > > > > > > > > > However, in my use case, I want the smaller queries to be fast and > > > > > can't have it re-parse the schema for every call. The input file thus > > > > > has to be a io::RandomAccesssFile. Looking at arrow/ipc/reader.h, the > > > > > only method that can serve this purpose seems to be: > > > > > > > > > > > > > > > Result<std::shared_ptr<RecordBatch>> ReadRecordBatch( > > > > > const Buffer& metadata, const std::shared_ptr<Schema>& schema, > > > > > const DictionaryMemo* dictionary_memo, const IpcReadOptions& options, > > > > > io::RandomAccessFile* file); > > > > > > > > > > > > > > > How do I efficiently read the file once to get the schema and > > > > > metadata in this case? My file does not have any dictionaries. Am I > > > > > thinking about this incorrectly? > > > > > > > > > > > > > > > Would appreciate any pointers. > > > > > > > > > > > > > > > Thanks, > > > > > Ishbir Singh
publickey - [email protected] - 0x21969656.asc
Description: application/pgp-keys
signature.asc
Description: OpenPGP digital signature
