Thanks Joris and She. This is exactly what I was looking for. With the new
custom functions feature of pyarrow, it might be possible to do it in
single pass .. though the cost of jumping to python might be prohibitively
expensive.

On Wed, Nov 2, 2022, 4:46 PM Joris Van den Bossche <
[email protected]> wrote:

> While there are indeed some workarounds possible by composing the
> existing kernels (as David shows), we should ideally have a direct
> kernel for this kind of operation, but that kernel currently doesn't
> exist.
>
> I recently ran into a similar issue, and I opened
> https://issues.apache.org/jira/browse/ARROW-18097 about a
> "list_contains" scalar kernel, which would already for checking
> against a single value. Maybe we then also want a "list_is_in" kernel
> for checking with multiple values (although one could already combine
> multiple "list_contains" calls).
>
> Joris
>
> On Wed, 2 Nov 2022 at 20:01, Suresh V <[email protected]> wrote:
> >
> > HI David .. Thank you very much for the response. I apologize for not
> posing the question correctly.
> >
> > The method you have does give the right answer, but it results in
> multiple new objects and multiple data passes.
> >
> > I was looking for a kernel which avoids that as I am dealing with really
> large arrays. Please let me know if I am not being clear.
> >
> > Thanks again for your help.
> >
> > On Wed, Nov 2, 2022, 2:40 PM Lee, David <[email protected]> wrote:
> >>
> >> Slight correction for 3 or 4 instead of just 3..
> >>
> >>
> >>
> >> result = pc.is_in(list(range(len(arr))), pc.filter(indices,
> pc.is_in(flat_arr, pa.array([3,4]))))
> >>
> >>
> >>
> >> From: Lee, David
> >> Sent: Wednesday, November 2, 2022 11:26 AM
> >> To: [email protected]
> >> Subject: RE: Filter a list array based on the contents of the list.
> >>
> >>
> >>
> >> This works..
> >>
> >>
> >>
> >> import pyarrow as pa
> >>
> >> import pyarrow.compute as pc
> >>
> >>
> >>
> >> arr = pa.array([[1,2],[3],[3,4,5]])
> >>
> >>
> >>
> >> indices = pc.list_parent_indices(arr)
> >>
> >> flat_arr = pc.list_flatten(arr)
> >>
> >>
> >>
> >>
> >>
> >> result = pc.is_in(list(range(len(arr))), pc.filter(indices,
> pc.equal(flat_arr, 3)))
> >>
> >>
> >>
> >> >>> result
> >>
> >> <pyarrow.lib.BooleanArray object at 0x00000243EA2D4D00>
> >>
> >> [
> >>
> >>   false,
> >>
> >>   true,
> >>
> >>   true
> >>
> >> ]
> >>
> >>
> >>
> >>
> >>
> >> From: Suresh V <[email protected]>
> >> Sent: Wednesday, November 2, 2022 10:23 AM
> >> To: [email protected]
> >> Subject: Filter a list array based on the contents of the list.
> >>
> >>
> >>
> >> External Email: Use caution with links and attachments
> >>
> >> Hi ..
> >>
> >>
> >>
> >> Is there a compute function I can use to filter an array with list
> entries based on the contents of the list?
> >>
> >>
> >>
> >> For eg.
> >>
> >> arr = pa.array([1,2],[3],[3,4,5]). I want to run a computer function
> which return true if the entries have 3 or 4.
> >>
> >>
> >>
> >> Expected output is:
> >>
> >> pa.array(False, True, True).
> >>
> >>
> >>
> >> The closest I could find was map lookup which expects the entries to be
> map.
> >>
> >>
> >>
> >> Thanks
> >>
> >>
> >>
> >> This message may contain information that is confidential or
> privileged. If you are not the intended recipient, please advise the sender
> immediately and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
> >>
> >>
> >> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
> >>
> >> © 2022 BlackRock, Inc. All rights reserved.
>

Reply via email to