>
> If parquet stores statistics for each column of a struct array (don't know
> offhand if they do) then we should create a JIRA to expose this.


It does store statistics per-leaf column.

On Wed, Apr 20, 2022 at 3:34 PM Weston Pace <[email protected]> wrote:

> No and no.  This filter will not be used for predicate pushdown now or in
> 8.0.0.  It could possibly come after 8.0.0.  If parquet stores statistics
> for each column of a struct array (don't know offhand if they do) then we
> should create a JIRA to expose this.
>
> On Wed, Apr 20, 2022, 11:01 AM Partha Dutta <[email protected]>
> wrote:
>
>> That works! Thanks. Do you know off hand if this filter would be used in
>> a predicate pushdown for a parquet dataset? Or would it be possibly coming
>> in version 8.0.0?
>>
>> On Wed, Apr 20, 2022 at 3:49 PM Weston Pace <[email protected]>
>> wrote:
>>
>>> The second argument to `call_function` should be a list (the args to
>>> the function).  Since `arr3` is iterable it is interpreting it as a
>>> list of args and trying to treat each row as an argument to your call
>>> (this is the reason it thinks you have 3 arguments).  This should
>>> work:
>>>
>>>     pc.call_function("struct_field", [arr3],
>>> pc.StructFieldOptions(indices=[0]))
>>>
>>> Unfortunately, that evaluates the function immediately.  If you want
>>> to create an expression then you need some way to create a call and I
>>> don't actually know how to do that.  I can do something a little
>>> hackish:
>>>
>>> table = pa.Table.from_pydict({'values': arr3})
>>> dataset = ds.dataset(table)
>>> sf_call = ds.field('')._call('struct_field', [ds.field('values')],
>>> pc.StructFieldOptions(indices=[0]))
>>> dataset.to_table(filter=sf_call < 200)
>>>
>>> However, I suspect there is probably a better way to create a call
>>> object than `ds.field('')._call(...)`
>>>
>>> On Wed, Apr 20, 2022 at 3:09 AM Partha Dutta <[email protected]>
>>> wrote:
>>> >
>>> > I'm trying to use the compute function struct_field in order to create
>>> an expression for dataset filtering. But running into an error. This is the
>>> code snippet:
>>> >
>>> > arr1 = pa.array([100, 200, 300])
>>> > arr2 = pa.array([400, 500, 600])
>>> > arr3 = pa.StructArray.from_arrays([arr1, arr2], ["one", "two"])
>>> > e = pc.call_function("struct_field", arr3,
>>> pc.StructFieldOptions(indices=[0])) > 200
>>> > Traceback (most recent call last):
>>> >   File "<stdin>", line 1, in <module>
>>> >   File "pyarrow/_compute.pyx", line 531, in
>>> pyarrow._compute.call_function
>>> >   File "pyarrow/_compute.pyx", line 330, in
>>> pyarrow._compute.Function.call
>>> >   File "pyarrow/error.pxi", line 143, in
>>> pyarrow.lib.pyarrow_internal_check_status
>>> >   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
>>> > pyarrow.lib.ArrowInvalid: Function 'struct_field' accepts 1 arguments
>>> but attempted to look up kernel(s) with 3
>>> >
>>> > If I try to exclude the options, I get
>>> > pyarrow.lib.ArrowInvalid: Function 'struct_field' cannot be called
>>> without options
>>> >
>>> > Any advice? I am using pyarrow 7.0.0
>>> > --
>>> > Partha Dutta
>>> > [email protected]
>>>
>>
>>
>> --
>> Partha Dutta
>> [email protected]
>>
>

Reply via email to