Re: [C++] - How to parallelize parquet column read operation

Yeshwanth Sriram Tue, 16 Mar 2021 18:45:56 -0700

This worked out well. I’m able to see multiple `ReadAt` calls concurrently and 
of course it waits for these calls to complete. The overall end2end latency of 
job is much lower now.


For anyone else wanting to know the rough sequence to create parquet reader 
with parallelized column readers is below. Sorry if it is verbose.

1. auto fp = some_filesystem->OpenInputFile(filename)
2. auto pqr = parquet::ParquetFileReader::Open(fp)
3. unique_ptr<parquet::arrow::FileReader> reader;
4. parquet::ArrowReaderProperties props;
5. props.set_pre_buffer(true);
6. auto status = parquet::arrow::FileReader::Make(arrow::default_memory_pool(), 
std::move(pqr), props, &reader)
7. reader->set_use_threads(true)

Thank you.

> On Mar 16, 2021, at 5:05 PM, Weston Pace <[email protected]> wrote:
> 
> The parquet::arrow::FileReader class takes in
> parquest::ArrowReaderProperties which have a use_threads option.  If
> true then the reader will parallelize column reads.  This flag is used
> in parquet/arrow/reader.cc to parallelize column reads (search for
> OptionalParallelFor).
> 
> This may or may not trigger the actual reading.  If prebuffering is
> off then there is a NextBatch call which will trigger the read needed
> for the column.
> 
> However, if prebuffering is on (best for performance) then an attempt
> will be made to combine reads following rules in
> src/arrow/io/caching.h.  This might be a fun place to do some
> experiments if you'd like.  In src/arrow/io/caching.cc you will see
> actual calls to arrow::io::RandomAccessFile::ReadAsync.  Keep in mind
> this is all with regards to the latest commits.  Some work has been
> done here since 3.0.
> 
> Also keep in mind that the use_threads flag is forced off when reading
> multiple files as part of a dataset scan.  This happens in
> arrow::dataset::MakeArrowReaderProperties inside of file_parquet.cc.
> I am currently working on ARROW-7001 which will allow us to keep the
> parallel reads.  That JIRA issue explains the issues faced by this.
> 
> I'm happy to provide more information if you'd like but I hope this
> gets you started.
> 
> On Tue, Mar 16, 2021 at 8:00 AM Yeshwanth Sriram <[email protected]> 
> wrote:
>> 
>> Hello,
>> 
>> I’ve managed to implement ADLFS/gen2 filesystem with reader/writers. I’m 
>> also able to read through data from ADLFS via parquet reader using my 
>> implementation. It is modeled like the s3fs implementation.
>> 
>> Question.
>> - Is way to parallelize the column read operation using multiple threads in 
>> parquet/reader.
>> - Can someone point to code in parquet subsystem where the final call is 
>> dispatched to the underlying random access file object.
>> 
>> Thank you
>> Yesh

Re: [C++] - How to parallelize parquet column read operation

Reply via email to