Re: [Parquet][Python, C++]Seg fault using new dataset api; filters not work with old dataset api

Xinyu Zeng Thu, 14 Apr 2022 05:29:00 -0700

Hi Alenka,

I assume the new implementation is for reading? Like when writing a
Parquet file we can still change the row group size. The seg fault
comes from reading, where we do not need to pass in row group size as
parameters.


For the filtering case, yes filtering is only supported in the new
dataset API, however, both the dataset api and read_table api can pass
in filters and set use_legacy_dataset=True, which in turn causes the
error above. There should be logic in the code to handle such a case
instead of printing errors. Or it should be noted in the doc.

On Thu, Apr 14, 2022 at 5:09 PM Alenka Frim <[email protected]> wrote:
>
> Hi Xinyu,
>
>> The result parquet file can be read by Spark. But using ParquetDataset
>> with use_legacy_dataset=False will result in segmentation fault. Set
>> use_legacy_dataset=True works fine.
>
>
> The new implementation does not support row_group_size.
> Can you try using max_rows_per_group together with the new API?
>
>> I also find that when use_legacy_dataset=True, it is not possible to
>> pass filters to the api, the error is following:
>>
>> Traceback (most recent call last):
>>   File "scripts/filter_exp.py", line 26, in <module>
>>     dataset = pq.ParquetDataset('lineitem_1K.parquet',
>> filesystem=None, use_legacy_dataset=True,
>>   File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py",
>> line 1439, in __init__
>>     self._filter(filters)
>>   File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py",
>> line 1561, in _filter
>>     accepts_filter = self._partitions.filter_accepts_partition
>> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
>>
>> I am using pyarrow 7.0.0 on Ubuntu 20.04.
>
>
> There might be some code missing in your example?
> In any case filtering is only supported in the new dataset API, see 
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html
>  (info under use_legacy_dataset parameter)
>
> Alenka

Re: [Parquet][Python, C++]Seg fault using new dataset api; filters not work with old dataset api

Reply via email to