Hello Apache Arrow Team,
I am using apache-arrow-3.0.0 and encountering a significant performance
issue reading parquet files over s3. I believe I have traced down the issue
to a very large number of curl requests being made apparently on an
as-needed basis ( see gdb trace below ). That is, there does not appear to
be any obvious buffering going on to amortize the over-the-wire latency. Am
I doing something wrong here? Is there some configuration parameter I can
tweak? Is this a known issue? Has it been addressed in newer versions? Any
guidance will be greatly appreciated.
Kind Regards,
Bipin
Trace of program using gdb stopped at "arrow::fs::(anonymous
namespace)::ObjectInputFile::ReadAt" Notice how small (nbytes) each request
is.
Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous
namespace)::ObjectInputFile::ReadAt (this=0x7fc960,
position=84273, nbytes=7846)
at
/home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740
740 ARROW_ASSIGN_OR_RAISE(int64_t bytes_read,
(gdb) cont
Continuing.
[New Thread 0x7fffde68b700 (LWP 750147)]
[Thread 0x7fffde68b700 (LWP 750147) exited]
Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous
namespace)::ObjectInputFile::ReadAt (this=0x7fc960,
position=92119, nbytes=6974)
at
/home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740
740 ARROW_ASSIGN_OR_RAISE(int64_t bytes_read,
(gdb) cont
Continuing.
Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous
namespace)::ObjectInputFile::ReadAt (this=0x7fc960,
position=99093, nbytes=7040)
at
/home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740
740 ARROW_ASSIGN_OR_RAISE(int64_t bytes_read,
(gdb) cont
Continuing.
[New Thread 0x7fffde68b700 (LWP 750164)]
[Thread 0x7fffde68b700 (LWP 750164) exited]
Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous
namespace)::ObjectInputFile::ReadAt (this=0x7fc960,
position=106133, nbytes=6875)
at
/home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740
740 ARROW_ASSIGN_OR_RAISE(int64_t bytes_read,
(gdb) cont
Continuing.
Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous
namespace)::ObjectInputFile::ReadAt (this=0x7fc960,
position=113008, nbytes=29380)
at
/home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740
740 ARROW_ASSIGN_OR_RAISE(int64_t bytes_read,
(gdb) cont
Continuing.
[New Thread 0x7fffde68b700 (LWP 750181)]
[Thread 0x7fffde68b700 (LWP 750181) exited]
Thread 1 "e.bin" hit Breakpoint 1, arrow::fs::(anonymous
namespace)::ObjectInputFile::ReadAt (this=0x7fc960,
position=142388, nbytes=26536)
at
/home/bmathew/kparquet/l64/build/arrow/cpp/src/arrow/filesystem/s3fs.cc:740
740 ARROW_ASSIGN_OR_RAISE(int64_t bytes_read,