Hi Xinyu,
The result parquet file can be read by Spark. But using ParquetDataset
> with use_legacy_dataset=False will result in segmentation fault. Set
> use_legacy_dataset=True works fine.
>
The new implementation does not support row_group_size.
Can you try using max_rows_per_group together with the new API?
I also find that when use_legacy_dataset=True, it is not possible to
> pass filters to the api, the error is following:
>
> Traceback (most recent call last):
> File "scripts/filter_exp.py", line 26, in <module>
> dataset = pq.ParquetDataset('lineitem_1K.parquet',
> filesystem=None, use_legacy_dataset=True,
> File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py",
> line 1439, in __init__
> self._filter(filters)
> File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py",
> line 1561, in _filter
> accepts_filter = self._partitions.filter_accepts_partition
> AttributeError: 'NoneType' object has no attribute
> 'filter_accepts_partition'
>
> I am using pyarrow 7.0.0 on Ubuntu 20.04.
>
There might be some code missing in your example?
In any case *filtering* is only supported in the new dataset API, see
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html
(info under use_legacy_dataset parameter)
Alenka