GitHub user sidneymau added a comment to the discussion: Iterating over parquet
dataset in batches
I believe the `dataset` module is the preferred way to do this:
```
import pyarrow.dataset as ds
dataset = ds.dataset(source_path, filters=cat_dict[cat]["cat_filter"]))
for batch in dataset.to_batches():
...
```
See https://arrow.apache.org/docs/python/dataset.html for some more docs
GitHub link:
https://github.com/apache/arrow/discussions/47988#discussioncomment-14807877
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]