GitHub user ikrommyd edited a comment on the discussion: Iterating over parquet dataset in batches
> I believe the `dataset` module is the preferred way to do this: > > ```python > import pyarrow.dataset as ds > > dataset = ds.dataset(source_path) > > for batch in dataset.to_batches(filter=cat_dict[cat]["cat_filter"]): > ... > ``` > > See https://arrow.apache.org/docs/python/dataset.html for some more docs Yeah I had seen that. The problem I had encountered was this ```py In [11]: filters = [("pt", ">", -1.0)] In [12]: dataset = pq.ParquetDataset("storage/NTuples/BBHto2G_M-125/nominal/", filters=filters) In [13]: batch = next(dataset._dataset.to_batches(filter=dataset._filter_expression)) In [14]: type(batch) Out[14]: pyarrow.lib.RecordBatch In [15]: dataset = ds.dataset("storage/NTuples/BBHto2G_M-125/nominal/") In [16]: type(dataset) Out[16]: pyarrow._dataset.FileSystemDataset In [17]: batch = next(dataset.to_batches(filter=filters)) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[17], line 1 ----> 1 batch = next(dataset.to_batches(filter=filters)) TypeError: Argument 'filter' has incorrect type (expected pyarrow._compute.Expression, got list) ``` `ParquetDataset` can get a list for its `filters` argument while `to_batches` expects an `Expression` as it's `filter` argument. Is there a public API to compile a list into an `Expression`? There definitely is a private one since `ParquetDataset` can do it. GitHub link: https://github.com/apache/arrow/discussions/47988#discussioncomment-14808010 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
