Hi Niranda,

Thanks a lot for the quick response! Yes, you're absolutely right, that
should work! I somehow missed that the filter compute function takes _any_
input type.

So that only leaves (2)... Using (1) to filter the result of
list_parent_indices gets pretty close:
filter([0, 0, 0, 1, 1, 2, 2, 2, 3, 4, 4], [false, false, true, false,
false, false, true, false, false, false, false]) = [0, 2]

Is there a shortcut in Arrow to transform a list of indices to a
corresponding boolean mask ([0, 2] --> [true, false, true, false, false])?

Cheers,
Leo

On Fri, Aug 13, 2021 at 10:45 PM Niranda Perera <[email protected]>
wrote:

> Hi Leo,
>
> Can't you call compute.filter with the resultant bool array from step (2)?
> 🤔
>
> On Fri, Aug 13, 2021, 06:17 Leonhard Gruenschloss <
> [email protected]> wrote:
>
>> Hi,
>>
>> I'd like to filter a ListArray, based on whether a particular value is
>> present in each list. Is there a better approach than the one described
>> below? Particularly, are there any existing compute functions that I could
>> use instead?
>>
>> Here's a concrete example, with rows consisting of variable-length lists
>> of strings:
>> ["a", "b", "x"]
>> ["c", "d"]
>> ["e", "x", "a"]
>> ["c"]
>> ["d, "e"]
>>
>> If the element to search for is "x", only the first and third row would
>> be retained after filtering:
>> ["a", "b", "x"]
>> ["e", "x", "a"]
>>
>> To implement this, the following should work, but is there a better way?
>>
>> (1) Run the "equal" compute function on the values of the list:
>> [false, false, true, false, false, false, true, false, false, false,
>> false]
>>
>> (2) Linearly scan the result of (1) in lockstep with the list's offsets,
>> to keep track of which rows matched:
>> [true, false, true, false, false]
>>
>> (3) Expand the result of (2) by the list lengths:
>> [true, true, true, false, false, true, true, true, false, false, false]
>>
>> (4) Use the "filter" compute function (using the result from (3)) to copy
>> only the matching values.
>> ["a", "b", "x", "e", "x", "a"]
>>
>> (5) Using the result of (2), sum up lengths to compute new offsets:
>> [0, 3, 6]
>>
>> (2), (3), and (5) are of course not difficult to implement, but is there
>> maybe a trick to use existing compute functions instead? Particularly for
>> non-C++ implementations that could make a big performance difference.
>>
>> Cheers,
>> Leo
>>
>>>

Reply via email to