[D] [C++][Parquet] High memory usage on large Parquet file reading with Project and Filter, with Scanner::Scan() [arrow]

via GitHub Thu, 14 May 2026 02:09:54 -0700


GitHub user alexeyroytman created a discussion: [C++][Parquet] High memory 
usage on large Parquet file reading with Project and Filter, with 
Scanner::Scan()


Hello.
I'm trying to fulfill a proof-of-concept for Arrow C++ usage for large Parquet 
file reading with partial-Project and Filter. (I've done the same earlier, with 
Java code of Apache Parquet project.)
I've seen an example of a code that has `ScannerBuilder::Project()` and 
`ScannerBuilder::Filter()`. In my case, it uses over 300% of CPU, gathers above 
30GB of virtual memory, starts thrashing, and at some moment OOM killer kills 
it.

Some background on the contents.

1. For now, I create a Parquet file by myself, and fully control it.
2. Its schema contains of D0..D9 (10 non-nullable strings as BINARY), M0-M4 (5 
nullable DOUBLEs).
3. Each of the D(i) is randomly selected from a relatively small subset of 
strings (up to thousands). 
4. M0 has a random value. Other M(j)s are nulls.
5. For D5 and D6 I enable dictionary, statistics and Bloom filter.
6. For M(j)s I disable dictionary, statistics and Bloom filter.
8. The rows are not sorted.
7. The Parquet file is uncompressed.
9. The Parquet file contains 1 billion of rows, its file size is 9.2 GiB. (The 
file creation time is ~50 minutes.)
11. The original comma-separated file in a tabular form is of 96.5 GiB. (When 
compressed with "`gzip -9`", it makes it 8.1 GiB).

The reading context:

1. I project all D(i), M0 and M1 columns.
2. I filter on: (D5==s51 || D5==s52) && (D6==s61 || D6==s61).
3. The expected number of rows is ~77 million (out of 1 billion).
4. I use `ScannerBuilder::Project()`, `ScannerBuilder::Filter()` and then 
`Scanner::Scan()`.
5. Inside the `Scanner::Scan()`, sometimes batches have 0 rows, not sure 
whether this is relevant.

I trace high memory allocations (500 MiB and above), all doing 
`PoolBuffer::Resize()` and/or `PoolBuffer::Reserve()`.
I see in the stacktraces:
1. `parquet::TypedDecoder<parquet::PhysicalType<(parquet::Type::type)6> 
>::DecodeArrowNonNull(int, 
parquet::EncodingTraits<parquet::PhysicalType<(parquet::Type::type)6> 
>::Accumulator*)`
2. `TransferColumnData(parquet::internal::RecordReader*, 
std::unique_ptr<parquet::ColumnChunkMetaData, 
std::default_delete<parquet::ColumnChunkMetaData> >, 
std::shared_ptr<arrow::Field> const&, parquet::ColumnDescriptor const*, 
parquet::arrow::ReaderContext const*, std::shared_ptr<arrow::ChunkedArray>*)` 


GitHub link: https://github.com/apache/arrow/discussions/49976

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

[D] [C++][Parquet] High memory usage on large Parquet file reading with Project and Filter, with Scanner::Scan() [arrow]

Reply via email to