pitrou opened a new pull request #7098: URL: https://github.com/apache/arrow/pull/7098
The AWS SDK creates a auto-growing StringStream by default, entailing multiple memory copies when transferring large data blocks (because of resizes). Instead, write directly into the target data area. Low-level benchmarks with a local Minio server: * before: ``` ----------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ----------------------------------------------------------------------------------------------------- MinioFixture/ReadAll500Mib/real_time 434528630 ns 431461370 ns 2 bytes_per_second=1.1237G/s items_per_second=2.30134/s MinioFixture/ReadChunked500Mib/real_time 419380389 ns 339293384 ns 2 bytes_per_second=1.16429G/s items_per_second=2.38447/s MinioFixture/ReadCoalesced500Mib/real_time 258812283 ns 470149 ns 3 bytes_per_second=1.88662G/s items_per_second=3.8638/s ``` * after: ``` MinioFixture/ReadAll500Mib/real_time 194620947 ns 161227337 ns 4 bytes_per_second=2.50888G/s items_per_second=5.13819/s MinioFixture/ReadChunked500Mib/real_time 276437393 ns 183030215 ns 3 bytes_per_second=1.76634G/s items_per_second=3.61746/s MinioFixture/ReadCoalesced500Mib/real_time 86693750 ns 448568 ns 6 bytes_per_second=5.63225G/s items_per_second=11.5349/s ``` Parquet read benchmarks from a local Minio server show speedups from 1.1x to 1.9x. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org