[jira] [Created] (ARROW-16294) [C++] Improve performance of parquet readahead

Weston Pace (Jira) Fri, 22 Apr 2022 15:59:05 -0700

Weston Pace created ARROW-16294:
-----------------------------------

             Summary: [C++] Improve performance of parquet readahead
                 Key: ARROW-16294
                 URL: https://issues.apache.org/jira/browse/ARROW-16294
             Project: Apache Arrow
          Issue Type: Improvement
            Reporter: Weston Pace



The 7.0.0 readahead for parquet would read up to 256 row groups at once which 
meant that, if the consumer were too slow, we would almost certainly run out of 
memory.

ARROW-15410 improved readahead as a whole and, in the process, changed parquet 
so it's always  reading 1 row group in advance.

This is not always ideal in S3 scenarios.  We may want to read many row groups 
in advance if the row groups are small.  To fix this we should continue reading 
in parallel until there are at least batch_size * batch_readahead rows being 
fetched.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16294) [C++] Improve performance of parquet readahead

Reply via email to