Hello!

I am curious what the recommended approach is for processing values in
arrow arrays/tables? Is a simple for loop fine most of the time, or is the
arrow::Iterator recommended, or is it better to somehow define a compute
function and let arrow do all of the iteration itself? I provide context
below for what I'm trying to do, and hopefully it makes clear why I'm
asking this question, and what it is I'm asking.

For reference, I am trying to remove rows based on duplicates in a
particular column. There doesn't seem to be a compute function that already
does this, and I can't think of a way to compose existing functions to get
what I need. I can think of a simple approach I can implement, and an
approach that requires a slight modification of an existing compute
function.

The simple approach I can think of is to: (1) take the column of interest,
(2) iterate over it and note the indices of values to drop, and (3) use
Table::Slice and arrow::ConcatenateTables to produce a result Table. This
feels like I'm missing out on some things that arrow may provide at least
for the 2nd step.

The better approach to step (2) above, would be to use
arrow::compute::Unique, but instead of producing unique values, produce
indexes. This way I could perhaps also setup a function Options that could
choose to keep the first duplicate, or keep the last, etc.

My C++ is not particularly advanced, so I find it hard to know where to
start for adapting an existing compute function (also, it is very hard to
search for the unique function because of "unique_ptr").

Thanks for any help and advice!

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz

Reply via email to