OK, I get it!
Indeed better to do it separately (loading the file in a dataset and then 
applying the filters.

Thank you for your answer :)


> Le 2 août 2021 à 20:37, Weston Pace <[email protected]> a écrit :
> 
> Hmm, it seems you managed to find a bit of an (I think) unintended use case 
> :).
> 
> The docs for pyarrow.parquet.read_table describe the "filters" property as:
> 
>> Each tuple has format: (key, op, value) and compares the key with the value. 
>> The
>> supported op are: = or ==, !=, <, >, <=, >=, in and not in. If the op is in 
>> or not in,
>> the value must be a collection such as a list, a set or a tuple.
>> 
>> Examples:
>> 
>> ('x', '=', 0)
>> ('y', 'in', ['a', 'b', 'c'])
>> ('z', 'not in', {'a','b'})
> 
> On the other hand, the filter you describe
> "~ds.field('my_field').is_valid()" is one of
> the new pyarrow.dataset expression-based filters.
> 
> pyarrow.parquet.read_table has been slowly migrating over to use the new 
> dataset
> scanning (controlled by use_legacy_dataset).  It seems in 3.0.0 we
> must have taken
> whatever filters argument was given and passed it directly as a
> filter.  In 4.0.0 we try
> and take a list of the previously described tuples and convert them to
> dataset filters.
> 
> So the easiest fix is probably to just use the new datasets API directly:
> 
> TL:DR;
> 
>    my_dataset = ds.dataset('myparquetFile.parquet')
>    table = my_dataset.to_table(filter=~ds.field('data').is_valid())
> 
> On Mon, Aug 2, 2021 at 3:01 AM Fabrice Lefloch <[email protected]> wrote:
>> 
>> Hello,
>> 
>> Previously when using pyarrow 3.0.0 when trying to filter null columns on 
>> read_table I was doing it this way:
>> pq.read_table(myparquetFile.parquet', filters=~ds.field(« 
>> my_field").is_valid())
>> It was working fine, but when upgrading top yarrow 4.0.0 I am now receiving 
>> an error
>> "ValueError: An Expression cannot be evaluated to python True or False. If 
>> you are using the 'and', 'or' or 'not' operators, use '&', '|' or '~' 
>> instead. »
>> I tried to use is_null() instead of is_valid() but with no luck either.
>> 
>> Is there some other way to apply this filter?
>> 
>> Thank you.

Reply via email to