GitHub user gistrec added a comment to the discussion: [C++][Parquet] High 
memory usage on large Parquet file reading with Project and Filter, with 
Scanner::Scan()

This looks less like a leak and more like scan concurrency / buffering

`Scanner` is designed to stream record batches, but with threads and readahead 
enabled it can still have multiple batches/fragments in flight at the same 
time. For wide/string-heavy Parquet columns, the decoded in-memory 
representation can be much larger than the file size

A few things worth trying:

1. Disable threading as a diagnostic check:

```cpp
ARROW_RETURN_NOT_OK(builder.UseThreads(false));
```

If RSS drops significantly, then the issue is likely parallel scan buffering 
rather than the filter itself

2. Iterate batches instead of materializing a large result:

```cpp
ARROW_ASSIGN_OR_RAISE(auto scanner, builder.Finish());
ARROW_ASSIGN_OR_RAISE(auto it, scanner->ScanBatches());

for (;;) {
  ARROW_ASSIGN_OR_RAISE(auto maybe_batch, it.Next());
  if (!maybe_batch) break;

  const auto& batch = maybe_batch->record_batch;
  // process batch here; don't accumulate all batches unless needed
}
```

3. Set a smaller batch size:

```cpp
ARROW_RETURN_NOT_OK(builder.BatchSize(65536));
```

4. If memory is still high, reduce readahead:

```cpp
ARROW_RETURN_NOT_OK(builder.FragmentReadahead(1));
// or 0 for the most conservative diagnostic run
```

This should not prevent normal Parquet predicate/statistics pushdown. The 
tradeoff is that lower threading/readahead may reduce peak memory at the cost 
of throughput


GitHub link: 
https://github.com/apache/arrow/discussions/49976#discussioncomment-17009712

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to