Re: Best practice for data iteration over arrays or tabular data

Burke Kaltenberger Thu, 30 Sep 2021 14:48:42 -0700

Please remove me from this email list

On Thu, Sep 30, 2021, 11:45 AM Aldrin <[email protected]> wrote:


> Hello!
>
> I am curious what the recommended approach is for processing values in
> arrow arrays/tables? Is a simple for loop fine most of the time, or is the
> arrow::Iterator recommended, or is it better to somehow define a compute
> function and let arrow do all of the iteration itself? I provide context
> below for what I'm trying to do, and hopefully it makes clear why I'm
> asking this question, and what it is I'm asking.
>
> For reference, I am trying to remove rows based on duplicates in a
> particular column. There doesn't seem to be a compute function that already
> does this, and I can't think of a way to compose existing functions to get
> what I need. I can think of a simple approach I can implement, and an
> approach that requires a slight modification of an existing compute
> function.
>
> The simple approach I can think of is to: (1) take the column of interest,
> (2) iterate over it and note the indices of values to drop, and (3) use
> Table::Slice and arrow::ConcatenateTables to produce a result Table. This
> feels like I'm missing out on some things that arrow may provide at least
> for the 2nd step.
>
> The better approach to step (2) above, would be to use
> arrow::compute::Unique, but instead of producing unique values, produce
> indexes. This way I could perhaps also setup a function Options that could
> choose to keep the first duplicate, or keep the last, etc.
>
> My C++ is not particularly advanced, so I find it hard to know where to
> start for adapting an existing compute function (also, it is very hard to
> search for the unique function because of "unique_ptr").
>
> Thanks for any help and advice!
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>

Re: Best practice for data iteration over arrays or tabular data

Reply via email to