Thank you Weston! I managed to significantly improve performance following the pointers you provided. I did not directly use ::PreBuffer but instead, since I am using the arrow reader, I simply set the set_pre_buffer flag.
On Thu, Nov 4, 2021 at 11:41 PM Weston Pace <[email protected]> wrote: > > Is there some configuration parameter I can tweak? Is this a known > issue? Has it been addressed in newer versions? > > Yes/Yes/Yes > > There has been some work done here in recent releases. I think some > of the biggest changes arrived in 4.0.0 but a few bug fixes have also > been done since then. > > If you are reading single files then what you will want to look for is > parquet::ParquetFileReader::PreBuffer. This function must be called > to give an indication of what data you plan to read. Once you've done > that the reader will combine small reads into larger reads which > should reduce the total number of reads. This should give a pretty > significant boost to S3 performance. > > You may also want to look into the datasets API. The datasets logic > not only prebuffers for you but also reads multiple files (and > multiple batches within a file) concurrently. > > On Thu, Nov 4, 2021 at 12:05 PM Bipin Mathew <[email protected]> > wrote: > > > > Hello Apache Arrow Team, > > > > I am using apache-arrow-3.0.0 and encountering a significant performance > issue reading parquet files over s3. I believe I have traced down the issue > to a very large number of curl requests being made apparently on an > as-needed basis ( see gdb trace below ). That is, there does not appear to > be any obvious buffering going on to amortize the over-the-wire latency. Am > I doing something wrong here? Is there some configuration parameter I can > tweak? Is this a known issue? Has it been addressed in newer versions? Any > guidance will be greatly appreciated. > > > > Kind Regards, > > > > Bipin > > > > Trace of program using gdb stopped at "arrow::fs::(anonymous > namespace)::ObjectInputFile::ReadAt" Notice how small (nbytes) each request > is. > > > > Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous > namespace)::ObjectInputFile::ReadAt (this=0x7fc960, > > position=84273, nbytes=7846) > > at > /home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740 > > 740 ARROW_ASSIGN_OR_RAISE(int64_t bytes_read, > > (gdb) cont > > Continuing. > > [New Thread 0x7fffde68b700 (LWP 750147)] > > [Thread 0x7fffde68b700 (LWP 750147) exited] > > > > Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous > namespace)::ObjectInputFile::ReadAt (this=0x7fc960, > > position=92119, nbytes=6974) > > at > /home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740 > > 740 ARROW_ASSIGN_OR_RAISE(int64_t bytes_read, > > (gdb) cont > > Continuing. > > > > Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous > namespace)::ObjectInputFile::ReadAt (this=0x7fc960, > > position=99093, nbytes=7040) > > at > /home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740 > > 740 ARROW_ASSIGN_OR_RAISE(int64_t bytes_read, > > (gdb) cont > > Continuing. > > [New Thread 0x7fffde68b700 (LWP 750164)] > > [Thread 0x7fffde68b700 (LWP 750164) exited] > > > > Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous > namespace)::ObjectInputFile::ReadAt (this=0x7fc960, > > position=106133, nbytes=6875) > > at > /home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740 > > 740 ARROW_ASSIGN_OR_RAISE(int64_t bytes_read, > > (gdb) cont > > Continuing. > > > > Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous > namespace)::ObjectInputFile::ReadAt (this=0x7fc960, > > position=113008, nbytes=29380) > > at > /home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740 > > 740 ARROW_ASSIGN_OR_RAISE(int64_t bytes_read, > > (gdb) cont > > Continuing. > > [New Thread 0x7fffde68b700 (LWP 750181)] > > [Thread 0x7fffde68b700 (LWP 750181) exited] > > > > Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous > namespace)::ObjectInputFile::ReadAt (this=0x7fc960, > > position=142388, nbytes=26536) > > at > /home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740 > > 740 ARROW_ASSIGN_OR_RAISE(int64_t bytes_read, >
