Re: [D] Iterating over parquet dataset in batches [arrow]

via GitHub Tue, 28 Oct 2025 13:14:22 -0700


GitHub user sidneymau added a comment to the discussion: Iterating over parquet 
dataset in batches


I believe the `dataset` module is the preferred way to do this:
```
import pyarrow.dataset as ds

dataset = ds.dataset(source_path, filters=cat_dict[cat]["cat_filter"]))

for batch in dataset.to_batches():
    ...
```

See https://arrow.apache.org/docs/python/dataset.html for some more docs

GitHub link: 
https://github.com/apache/arrow/discussions/47988#discussioncomment-14807877

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] Iterating over parquet dataset in batches [arrow]

Reply via email to