>
> Is there a shortcut in Arrow to transform a list of indices to a
> corresponding boolean mask ([0, 2] --> [true, false, true, false, false])?
>

I just realized that one doesn't even need this transformation when using
the "take" compute function! So it all boils down to this short list of
compute functions, without the need for any "low level" code:

(1) Run the "equal" compute function on the values of the list:
[false, false, true, false, false, false, true, false, false, false, false]

(2) Run the "list_parent_indices" compute function:
[0, 0, 0, 1, 1, 2, 2, 2, 3, 4, 4]

(3) Use the result of (1) to "filter" the result of (2):
[0, 2]

(4) Run "unique" on the result of (3) in case the same value may appear
multiple times in a list:
[0, 2]

(5) Use the result of (4) to "take" from the list:
["a", "b", "x"]
["e", "x", "a"]

On Fri, Aug 13, 2021 at 11:24 PM Leonhard Gruenschloss <
[email protected]> wrote:

> Hi Niranda,
>
> Thanks a lot for the quick response! Yes, you're absolutely right, that
> should work! I somehow missed that the filter compute function takes _any_
> input type.
>
> So that only leaves (2)... Using (1) to filter the result of
> list_parent_indices gets pretty close:
> filter([0, 0, 0, 1, 1, 2, 2, 2, 3, 4, 4], [false, false, true, false,
> false, false, true, false, false, false, false]) = [0, 2]
>
> Is there a shortcut in Arrow to transform a list of indices to a
> corresponding boolean mask ([0, 2] --> [true, false, true, false, false])?
>
> Cheers,
> Leo
>
> On Fri, Aug 13, 2021 at 10:45 PM Niranda Perera <[email protected]>
> wrote:
>
>> Hi Leo,
>>
>> Can't you call compute.filter with the resultant bool array from step
>> (2)? 🤔
>>
>> On Fri, Aug 13, 2021, 06:17 Leonhard Gruenschloss <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> I'd like to filter a ListArray, based on whether a particular value is
>>> present in each list. Is there a better approach than the one described
>>> below? Particularly, are there any existing compute functions that I could
>>> use instead?
>>>
>>> Here's a concrete example, with rows consisting of variable-length lists
>>> of strings:
>>> ["a", "b", "x"]
>>> ["c", "d"]
>>> ["e", "x", "a"]
>>> ["c"]
>>> ["d, "e"]
>>>
>>> If the element to search for is "x", only the first and third row would
>>> be retained after filtering:
>>> ["a", "b", "x"]
>>> ["e", "x", "a"]
>>>
>>> To implement this, the following should work, but is there a better way?
>>>
>>> (1) Run the "equal" compute function on the values of the list:
>>> [false, false, true, false, false, false, true, false, false, false,
>>> false]
>>>
>>> (2) Linearly scan the result of (1) in lockstep with the list's offsets,
>>> to keep track of which rows matched:
>>> [true, false, true, false, false]
>>>
>>> (3) Expand the result of (2) by the list lengths:
>>> [true, true, true, false, false, true, true, true, false, false, false]
>>>
>>> (4) Use the "filter" compute function (using the result from (3)) to
>>> copy only the matching values.
>>> ["a", "b", "x", "e", "x", "a"]
>>>
>>> (5) Using the result of (2), sum up lengths to compute new offsets:
>>> [0, 3, 6]
>>>
>>> (2), (3), and (5) are of course not difficult to implement, but is there
>>> maybe a trick to use existing compute functions instead? Particularly for
>>> non-C++ implementations that could make a big performance difference.
>>>
>>> Cheers,
>>> Leo
>>>
>>>>

Reply via email to