GitHub user felipecrv added a comment to the discussion: problem parallelizing 
reading of multiple parquet files using S3FileSystem

How far are you from saturating your available network bandwidth?

You should try disabling "use threads" in the scanner and parquet reader since 
you're managing the scheduling now. These options means more thread creation 
and scheduler overhead when you already have `tbb::parallel_for` trying to 
schedule the tasks at a more coarse-grained level. It might not improve the 
performance, but it's worth trying.

Other things to experiment tweaking is increasing the block size (in number of 
files) of the parallel-for. Spawning too many tasks might create too much 
competition for I/O and CPU which might turn out to make things worse.

A profiler can show where the bottlenecks are.

GitHub link: 
https://github.com/apache/arrow/discussions/47160#discussioncomment-13863950

----
This is an automatically sent email for user@arrow.apache.org.
To unsubscribe, please send an email to: user-unsubscr...@arrow.apache.org

Reply via email to