Hi Alenka,

I assume the new implementation is for reading? Like when writing a
Parquet file we can still change the row group size. The seg fault
comes from reading, where we do not need to pass in row group size as
parameters.

For the filtering case, yes filtering is only supported in the new
dataset API, however, both the dataset api and read_table api can pass
in filters and set use_legacy_dataset=True, which in turn causes the
error above. There should be logic in the code to handle such a case
instead of printing errors. Or it should be noted in the doc.

On Thu, Apr 14, 2022 at 5:09 PM Alenka Frim <[email protected]> wrote:
>
> Hi Xinyu,
>
>> The result parquet file can be read by Spark. But using ParquetDataset
>> with use_legacy_dataset=False will result in segmentation fault. Set
>> use_legacy_dataset=True works fine.
>
>
> The new implementation does not support row_group_size.
> Can you try using max_rows_per_group together with the new API?
>
>> I also find that when use_legacy_dataset=True, it is not possible to
>> pass filters to the api, the error is following:
>>
>> Traceback (most recent call last):
>>   File "scripts/filter_exp.py", line 26, in <module>
>>     dataset = pq.ParquetDataset('lineitem_1K.parquet',
>> filesystem=None, use_legacy_dataset=True,
>>   File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py",
>> line 1439, in __init__
>>     self._filter(filters)
>>   File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py",
>> line 1561, in _filter
>>     accepts_filter = self._partitions.filter_accepts_partition
>> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
>>
>> I am using pyarrow 7.0.0 on Ubuntu 20.04.
>
>
> There might be some code missing in your example?
> In any case filtering is only supported in the new dataset API, see 
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html
>  (info under use_legacy_dataset parameter)
>
> Alenka

Reply via email to