GitHub user aavbsouza added a comment to the discussion: It is possible to 
reduce peak memory usage when using datasets (to use predicate pushdown) when 
reading single parquet files

Hello. The suggestion by @adamreeve  to reduce the batch_readahead was 
effective in reduce the memory consumption with a increase in time to read the 
file. What I found to be more unexpected is that the memory used (with 
readahead of 16) to be many times greater than the size of the uncompressed 
file (the parquet file has 6.8GB and the  saved column 7.6GB) at about 70GB.
@pitrou I have built the arrow library using VCPKG with the jemalloc feature, 
changing the environment variable to system it reduced the max rss (using time 
-v) to about half of the memory usage of the jemalloc pool

GitHub link: 
https://github.com/apache/arrow/discussions/47003#discussioncomment-13739945

----
This is an automatically sent email for user@arrow.apache.org.
To unsubscribe, please send an email to: user-unsubscr...@arrow.apache.org

Reply via email to