GitHub user ikrommyd closed a discussion: Iterating over parquet dataset in 
batches

Hello,

I would like to iterate over a parquet dataset in batches.
I get my parquet dataset like this
```py
import pyarrow.parquet as pq

dataset = pq.ParquetDataset(source_path, filters=cat_dict[cat]["cat_filter"])
```
However, the `pyarrow.parquet.ParquetDataset` class doesn't seem to have a 
batched iteration method.
After very briefly looking at its source code, I found that I can access the 
underlying `pyarrow.dataset.Dataset` by with the `_dataset` attribute and that 
has a `to_batches` method.
Therefore I'm doing this to iterate
```py
for batch in dataset._dataset.to_batches(filter=dataset._filter_expression):
    ...
```
That feels a bit hacky to me because I'm depending on internals like 
`dataset._dataset` and `dataset._filter_expression`. Is this the proper way to 
do this? Is there a better API that I couldn't find that users should be using?

Thanks in advance!

GitHub link: https://github.com/apache/arrow/discussions/47988

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to