Hmm, it seems you managed to find a bit of an (I think) unintended use case :).

The docs for pyarrow.parquet.read_table describe the "filters" property as:

> Each tuple has format: (key, op, value) and compares the key with the value. 
> The
> supported op are: = or ==, !=, <, >, <=, >=, in and not in. If the op is in 
> or not in,
> the value must be a collection such as a list, a set or a tuple.
>
> Examples:
>
> ('x', '=', 0)
> ('y', 'in', ['a', 'b', 'c'])
> ('z', 'not in', {'a','b'})

On the other hand, the filter you describe
"~ds.field('my_field').is_valid()" is one of
the new pyarrow.dataset expression-based filters.

pyarrow.parquet.read_table has been slowly migrating over to use the new dataset
scanning (controlled by use_legacy_dataset).  It seems in 3.0.0 we
must have taken
whatever filters argument was given and passed it directly as a
filter.  In 4.0.0 we try
and take a list of the previously described tuples and convert them to
dataset filters.

So the easiest fix is probably to just use the new datasets API directly:

TL:DR;

    my_dataset = ds.dataset('myparquetFile.parquet')
    table = my_dataset.to_table(filter=~ds.field('data').is_valid())

On Mon, Aug 2, 2021 at 3:01 AM Fabrice Lefloch <[email protected]> wrote:
>
> Hello,
>
> Previously when using pyarrow 3.0.0 when trying to filter null columns on 
> read_table I was doing it this way:
> pq.read_table(myparquetFile.parquet', filters=~ds.field(« 
> my_field").is_valid())
> It was working fine, but when upgrading top yarrow 4.0.0 I am now receiving 
> an error
> "ValueError: An Expression cannot be evaluated to python True or False. If 
> you are using the 'and', 'or' or 'not' operators, use '&', '|' or '~' 
> instead. »
> I tried to use is_null() instead of is_valid() but with no luck either.
>
> Is there some other way to apply this filter?
>
> Thank you.

Reply via email to