Thank you Weston! I managed to significantly improve performance
following the pointers you provided. I did not directly use ::PreBuffer but
instead, since I am using the arrow reader, I simply set the set_pre_buffer
flag.


On Thu, Nov 4, 2021 at 11:41 PM Weston Pace <[email protected]> wrote:

> > Is there some configuration parameter I can tweak? Is this a known
> issue? Has it been addressed in newer versions?
>
> Yes/Yes/Yes
>
> There has been some work done here in recent releases.  I think some
> of the biggest changes arrived in 4.0.0 but a few bug fixes have also
> been done since then.
>
> If you are reading single files then what you will want to look for is
> parquet::ParquetFileReader::PreBuffer.  This function must be called
> to give an indication of what data you plan to read.  Once you've done
> that the reader will combine small reads into larger reads which
> should reduce the total number of reads.  This should give a pretty
> significant boost to S3 performance.
>
> You may also want to look into the datasets API.  The datasets logic
> not only prebuffers for you but also reads multiple files (and
> multiple batches within a file) concurrently.
>
> On Thu, Nov 4, 2021 at 12:05 PM Bipin Mathew <[email protected]>
> wrote:
> >
> > Hello Apache Arrow Team,
> >
> > I am using apache-arrow-3.0.0 and encountering a significant performance
> issue reading parquet files over s3. I believe I have traced down the issue
> to a very large number of curl requests being made apparently on an
> as-needed basis ( see gdb trace below ). That is, there does not appear to
> be any obvious buffering going on to amortize the over-the-wire latency. Am
> I doing something wrong here? Is there some configuration parameter I can
> tweak? Is this a known issue? Has it been addressed in newer versions? Any
> guidance will be greatly appreciated.
> >
> > Kind Regards,
> >
> > Bipin
> >
> > Trace of program using gdb stopped at "arrow::fs::(anonymous
> namespace)::ObjectInputFile::ReadAt" Notice how small (nbytes) each request
> is.
> >
> > Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous
> namespace)::ObjectInputFile::ReadAt (this=0x7fc960,
> >     position=84273, nbytes=7846)
> >     at
> /home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740
> > 740           ARROW_ASSIGN_OR_RAISE(int64_t bytes_read,
> > (gdb) cont
> > Continuing.
> > [New Thread 0x7fffde68b700 (LWP 750147)]
> > [Thread 0x7fffde68b700 (LWP 750147) exited]
> >
> > Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous
> namespace)::ObjectInputFile::ReadAt (this=0x7fc960,
> >     position=92119, nbytes=6974)
> >     at
> /home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740
> > 740           ARROW_ASSIGN_OR_RAISE(int64_t bytes_read,
> > (gdb) cont
> > Continuing.
> >
> > Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous
> namespace)::ObjectInputFile::ReadAt (this=0x7fc960,
> >     position=99093, nbytes=7040)
> >     at
> /home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740
> > 740           ARROW_ASSIGN_OR_RAISE(int64_t bytes_read,
> > (gdb) cont
> > Continuing.
> > [New Thread 0x7fffde68b700 (LWP 750164)]
> > [Thread 0x7fffde68b700 (LWP 750164) exited]
> >
> > Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous
> namespace)::ObjectInputFile::ReadAt (this=0x7fc960,
> >     position=106133, nbytes=6875)
> >     at
> /home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740
> > 740           ARROW_ASSIGN_OR_RAISE(int64_t bytes_read,
> > (gdb) cont
> > Continuing.
> >
> > Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous
> namespace)::ObjectInputFile::ReadAt (this=0x7fc960,
> >     position=113008, nbytes=29380)
> >     at
> /home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740
> > 740           ARROW_ASSIGN_OR_RAISE(int64_t bytes_read,
> > (gdb) cont
> > Continuing.
> > [New Thread 0x7fffde68b700 (LWP 750181)]
> > [Thread 0x7fffde68b700 (LWP 750181) exited]
> >
> > Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous
> namespace)::ObjectInputFile::ReadAt (this=0x7fc960,
> >     position=142388, nbytes=26536)
> >     at
> /home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740
> > 740           ARROW_ASSIGN_OR_RAISE(int64_t bytes_read,
>

Reply via email to