Setting the batch size will not have too much of an impact on the amount of memory used. That is mostly controlled by I/O readahead (e.g. how many record batches to read at once). The readahead settings are not currently exposed to pyarrow although a PR was recently merged[1] that should make this available in 10.0.0.
OOM when reading a single 2GB parquet file seems kind of extreme. How much RAM is available on the system? Do you know if the parquet file has some very compressive encodings (e.g. dictionary encoding with long strings or run length encoding with long runs)? > I am confused and what to know is the underlying behavior in C++ Arrow > Parquet reader, when setting batch_size to be small? Basically the readahead tries to keep some number of rows in flight. If the batches are small then it tries to run lots of rows at once. If the batches are large then it will only run a few rows at once. So yes, extremely small batches will incur a lot of overhead, both in terms of RAM and compute. > My end goal is to sample just a few rows (~5 rows) from any Parquet file, to > estimate in-memory data size of the whole file, based on sampled rows. I'm not sure 5 rows will be enough for this. However, one option might be to just read in a single row group (assuming the file has multiple row groups). One last idea might be to disable pre-buffering. Pre-buffering is currently using too much RAM on file reads[2]. You could also try setting use_legacy_dataset to True. The legacy reader isn't quite so aggressive with readahead and might use less RAM. However, I still don't think you'll be able to do better than reading a single row group. [1] https://github.com/apache/arrow/pull/13799 [2] https://issues.apache.org/jira/browse/ARROW-17599 On Fri, Sep 2, 2022 at 8:32 AM Cheng Su <[email protected]> wrote: > > Hello, > > I am using PyArrow, and encountering an OOM issue when reading the Parquet > file. My end goal is to sample just a few rows (~5 rows) from any Parquet > file, to estimate in-memory data size of the whole file, based on sampled > rows. > > We tried the following approaches: > * `to_batches(batch_size=5)` - > https://arrow.apache.org/docs/python/generated/pyarrow.dataset.FileSystemDataset.html#pyarrow.dataset.FileSystemDataset.to_batches > * `head(num_rows=5, batch_size=5)` - > https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.head > > But with both approaches, we encountered OOM issues when just reading 5 rows > several times from ~2GB Parquet file.Then we tried > `to_batches(batch_size=100000)`, and it works fine without OOM issue. > > I am confused and what to know is the underlying behavior in C++ Arrow > Parquet reader, when setting batch_size to be small? I guess there might be > some exponential overhead associated with batch_size when its value is small. > > Thanks, > Cheng Su
