GitHub user ikrommyd created a discussion: Iterating over parquet dataset in
batches
Hello,
I would like to iterate over a parquet dataset in batches.
I get my parquet dataset like this
```py
import pyarrow.parquet as pq
dataset = pq.ParquetDataset(source_path, filters=cat_dict[cat]["cat_filter"])
```
However, the `pyarrow.parquet.ParquetDataset` class doesn't seem to have a
batched iteration method.
After very briefly looking at its source code, I found that I can access the
underlying `pyarrow.dataset.Dataset` by with the `_dataset` attribute.
Therefore I'm doing this to iterate
```py
for batch in dataset._dataset.to_batches(filter=dataset._filter_expression):
...
```
That feels a bit hacky to me because I'm depending on internals like
`dataset._dataset` and `dataset._filter_expression`. Is this the proper way to
do this? Is there a better API that I couldn't find that users should be using?
Thanks in advance!
GitHub link: https://github.com/apache/arrow/discussions/47988
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]