GitHub user gistrec added a comment to the discussion: [C++][Parquet] High
memory usage on large Parquet file reading with Project and Filter, with
Scanner::Scan()
This looks less like a leak and more like scan concurrency / buffering
`Scanner` is designed to stream record batches, but with threads and readahead
enabled it can still have multiple batches/fragments in flight at the same
time. For wide/string-heavy Parquet columns, the decoded in-memory
representation can be much larger than the file size
A few things worth trying:
1. Disable threading as a diagnostic check:
```cpp
ARROW_RETURN_NOT_OK(builder.UseThreads(false));
```
If RSS drops significantly, then the issue is likely parallel scan buffering
rather than the filter itself
2. Iterate batches instead of materializing a large result:
```cpp
ARROW_ASSIGN_OR_RAISE(auto scanner, builder.Finish());
ARROW_ASSIGN_OR_RAISE(auto it, scanner->ScanBatches());
for (;;) {
ARROW_ASSIGN_OR_RAISE(auto maybe_batch, it.Next());
if (!maybe_batch) break;
const auto& batch = maybe_batch->record_batch;
// process batch here; don't accumulate all batches unless needed
}
```
3. Set a smaller batch size:
```cpp
ARROW_RETURN_NOT_OK(builder.BatchSize(65536));
```
4. If memory is still high, reduce readahead:
```cpp
ARROW_RETURN_NOT_OK(builder.FragmentReadahead(1));
// or 0 for the most conservative diagnostic run
```
This should not prevent normal Parquet predicate/statistics pushdown. The
tradeoff is that lower threading/readahead may reduce peak memory at the cost
of throughput
GitHub link:
https://github.com/apache/arrow/discussions/49976#discussioncomment-17009712
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]