Re: [Parquet][Python, C++]Seg fault using new dataset api; filters not work with old dataset api

Alenka Frim Thu, 14 Apr 2022 02:09:59 -0700

Hi Xinyu,

The result parquet file can be read by Spark. But using ParquetDataset
> with use_legacy_dataset=False will result in segmentation fault. Set
> use_legacy_dataset=True works fine.
>


The new implementation does not support row_group_size.
Can you try using max_rows_per_group together with the new API?

I also find that when use_legacy_dataset=True, it is not possible to
> pass filters to the api, the error is following:
>
> Traceback (most recent call last):
>   File "scripts/filter_exp.py", line 26, in <module>
>     dataset = pq.ParquetDataset('lineitem_1K.parquet',
> filesystem=None, use_legacy_dataset=True,
>   File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py",
> line 1439, in __init__
>     self._filter(filters)
>   File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py",
> line 1561, in _filter
>     accepts_filter = self._partitions.filter_accepts_partition
> AttributeError: 'NoneType' object has no attribute
> 'filter_accepts_partition'
>
> I am using pyarrow 7.0.0 on Ubuntu 20.04.
>

There might be some code missing in your example?
In any case *filtering* is only supported in the new dataset API, see
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html
(info under use_legacy_dataset parameter)

Alenka

Re: [Parquet][Python, C++]Seg fault using new dataset api; filters not work with old dataset api

Reply via email to