Re: [C++] [Arrow IPC] Efficient Multiple Reads

Aldrin Wed, 08 Nov 2023 10:32:09 -0800

Also, it seems I should clarify: I am talking about the write side because that 
determines the characteristics your read path will have. I'm not talking about 
writing files each time you do a query, I'm talking about write the files one 
time in a way that when you have a query you know you've optimized for certain 
read patterns.




# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene

https://keybase.io/octalene


On Wednesday, November 8th, 2023 at 10:27, Aldrin <[email protected]> wrote:


> well, [1] shows a way to get the metadata, but you'll have to follow the 
> function chain to figure out if there's a way to just get the metadata for a 
> RecordBatch without reading the data for it (I couldn't do it in ~5 min).
> 

> 

> -   I forgot to mention that the subsets of columns chosen is dynamic... It 
> wouldn’t make sense to rewrite the files for each query.
>     

> 

> 

> I'm just talking about writing the files to be less wide, and/or writing 
> files that contain only the metadata and no actual data (schema and schema 
> metadata) to initialize a RecordBatchStreamReader [2] from. Once you 
> initialize a RecordBatchStreamReader, you can feed it binary data and the 
> process looks like a reader with a pre-existing schema but you're managing 
> the file access (so you have to be more intentional in your file management).
> 

> To Weston's point, if you have wide feather files and processing the (schema, 
> recordbatch, or feather file metadata, I'm not sure in particular which one 
> you're both referring to) is costly then you probably need to change 
> something in your process to get speed-ups.
> 

> 

> [1]: 
> https://github.com/apache/arrow/blob/main/cpp/src/arrow/ipc/reader.cc#L867-L876
> [2]: 
> https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc23RecordBatchStreamReaderE
> 

> 

> 

> # ------------------------------
> 

> # Aldrin
> 

> 

> https://github.com/drin/
> 

> https://gitlab.com/octalene
> 

> https://keybase.io/octalene
> 

> 

> On Wednesday, November 8th, 2023 at 09:45, Weston Pace 
> <[email protected]> wrote:
> 

> 

> > You are correct that there is no existing capability to create an IPC 
> > reader with precomputed metadata. I don't think anyone would be opposed to 
> > this feature, it just hasn't been a priority.
> > 

> > If you wanted to avoid changing arrow then you could create your own 
> > implementation of `RandomAccessFile` which is partially backed by an 
> > in-memory buffer and fetches from file when the reads go out of the 
> > buffered range. However, I'm not sure that I/O is the culprit. Are you 
> > reading from a local file? If so, then the future reads would probably 
> > already be cached by the OS (unless maybe you are under memory pressure).
> > 

> > Perhaps it is the CPU cost of processing the metadata that is slowing down 
> > your reads. If that is the case then I think a code change is inevitable.
> > 

> > 

> > On Wed, Nov 8, 2023 at 6:43 AM Ishbir Singh <[email protected]> wrote:
> > 

> > > Thanks for the info, Aldrin. I forgot to mention that the subsets of 
> > > columns chosen is dynamic. Basically, I have a web server serving columns 
> > > from the file. It wouldn’t make sense to rewrite the files for each query.
> > > 

> > > I’m just looking for the easiest way to read the metadata as a buffer so 
> > > I can pass it to the function below because I believe that should 
> > > accomplish what I want.
> > > 

> > > Thanks,
> > > Ishbir Singh
> > > 

> > > W dniu wt., 7.11.2023 o 20:56 Aldrin <[email protected]> napisał(a):
> > > 

> > > > In a probably too short answer, I think you want to do one of the 
> > > > following:
> > > > 

> > > > - write a single feather file with many batches
> > > > - write many feather files but using the dataset API to hopefully have 
> > > > arrow do some multi-file optimization for you (and hopefully still have 
> > > > multiple batches per file)
> > > > - write the schema in one file (or as few files as there are schemas) 
> > > > and write many (N) recordbatches to fewer files (M) using the stream 
> > > > interface (instead of file)
> > > > 

> > > > I do the 3rd one and I do it because I made assumptions about data 
> > > > accesses but I have not validated those assumptions. The main 
> > > > assumption being that writing a RecordBatch with the stream API is not 
> > > > rewriting the schema each time (or having equivalent amplification on 
> > > > the read side).
> > > > 

> > > > Let me know if there's any approach you want more info on and I can 
> > > > follow up or maybe someone else can chime in/correct me.
> > > > 

> > > > Sent from Proton Mail for iOS
> > > > 

> > > > 

> > > > On Tue, Nov 7, 2023 at 10:27, Ishbir Singh <[email protected]> wrote:
> > > > 

> > > > > Apologies if this is the wrong place for this, but I'm looking to 
> > > > > repeatedly select a subset of columns from a wide feather file (which 
> > > > > has ~200k columns). What I find is that if I use 
> > > > > RecordBatchReader::Open with the requisite arguments asking it to 
> > > > > select the particular columns, it reads the schema over and over 
> > > > > (once per Open call). Now that is to be expected as there doesn't 
> > > > > seem to be a way to pass a pre-existing schema.
> > > > > 

> > > > > 

> > > > > However, in my use case, I want the smaller queries to be fast and 
> > > > > can't have it re-parse the schema for every call. The input file thus 
> > > > > has to be a io::RandomAccesssFile. Looking at arrow/ipc/reader.h, the 
> > > > > only method that can serve this purpose seems to be:
> > > > > 

> > > > > 

> > > > > Result<std::shared_ptr<RecordBatch>> ReadRecordBatch(
> > > > > const Buffer& metadata, const std::shared_ptr<Schema>& schema,
> > > > > const DictionaryMemo* dictionary_memo, const IpcReadOptions& options,
> > > > > io::RandomAccessFile* file);
> > > > > 

> > > > > 

> > > > > How do I efficiently read the file once to get the schema and 
> > > > > metadata in this case? My file does not have any dictionaries. Am I 
> > > > > thinking about this incorrectly?
> > > > > 

> > > > > 

> > > > > Would appreciate any pointers.
> > > > > 

> > > > > 

> > > > > Thanks,
> > > > > Ishbir Singh

publickey - [email protected] - 0x21969656.asc
Description: application/pgp-keys

signature.asc
Description: OpenPGP digital signature

Re: [C++] [Arrow IPC] Efficient Multiple Reads

Reply via email to