Re: [D] Iterating over parquet dataset in batches [arrow]

via GitHub Tue, 28 Oct 2025 13:38:13 -0700


GitHub user ikrommyd edited a comment on the discussion: Iterating over parquet 
dataset in batches


> I believe the `dataset` module is the preferred way to do this:
> 
> ```python
> import pyarrow.dataset as ds
> 
> dataset = ds.dataset(source_path)
> 
> for batch in dataset.to_batches(filter=cat_dict[cat]["cat_filter"]):
>     ...
> ```
> 
> See https://arrow.apache.org/docs/python/dataset.html for some more docs

Yeah I had seen that. The problem I had encountered was this
```py
In [11]: filters = [("pt", ">", -1.0)]

In [12]: dataset = pq.ParquetDataset("storage/NTuples/BBHto2G_M-125/nominal/", 
filters=filters)

In [13]: batch = 
next(dataset._dataset.to_batches(filter=dataset._filter_expression))

In [14]: type(batch)
Out[14]: pyarrow.lib.RecordBatch

In [15]: dataset = ds.dataset("storage/NTuples/BBHto2G_M-125/nominal/")

In [16]: type(dataset)
Out[16]: pyarrow._dataset.FileSystemDataset

In [17]: batch = next(dataset.to_batches(filter=filters))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[17], line 1
----> 1 batch = next(dataset.to_batches(filter=filters))

TypeError: Argument 'filter' has incorrect type (expected 
pyarrow._compute.Expression, got list)
```
`ParquetDataset` can get a list for its `filters` argument while `to_batches` 
expects an `Expression` as it's `filter` argument.
Is there a public API to compile a list into an `Expression`? There definitely 
is a private one since `ParquetDataset` can do it.

GitHub link: 
https://github.com/apache/arrow/discussions/47988#discussioncomment-14808010

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] Iterating over parquet dataset in batches [arrow]

Reply via email to