GitHub user felipecrv added a comment to the discussion: problem parallelizing reading of multiple parquet files using S3FileSystem
How far are you from saturating your available network bandwidth? You should try disabling "use threads" in the scanner and parquet reader since you're managing the scheduling now. These options means more thread creation and scheduler overhead when you already have `tbb::parallel_for` trying to schedule the tasks at a more coarse-grained level. It might not improve the performance, but it's worth trying. Other things to experiment tweaking is increasing the block size (in number of files) of the parallel-for. Spawning too many tasks might create too much competition for I/O and CPU which might turn out to make things worse. A profiler can show where the bottlenecks are. GitHub link: https://github.com/apache/arrow/discussions/47160#discussioncomment-13863950 ---- This is an automatically sent email for user@arrow.apache.org. To unsubscribe, please send an email to: user-unsubscr...@arrow.apache.org