Hello! I am curious what the recommended approach is for processing values in arrow arrays/tables? Is a simple for loop fine most of the time, or is the arrow::Iterator recommended, or is it better to somehow define a compute function and let arrow do all of the iteration itself? I provide context below for what I'm trying to do, and hopefully it makes clear why I'm asking this question, and what it is I'm asking.
For reference, I am trying to remove rows based on duplicates in a particular column. There doesn't seem to be a compute function that already does this, and I can't think of a way to compose existing functions to get what I need. I can think of a simple approach I can implement, and an approach that requires a slight modification of an existing compute function. The simple approach I can think of is to: (1) take the column of interest, (2) iterate over it and note the indices of values to drop, and (3) use Table::Slice and arrow::ConcatenateTables to produce a result Table. This feels like I'm missing out on some things that arrow may provide at least for the 2nd step. The better approach to step (2) above, would be to use arrow::compute::Unique, but instead of producing unique values, produce indexes. This way I could perhaps also setup a function Options that could choose to keep the first duplicate, or keep the last, etc. My C++ is not particularly advanced, so I find it hard to know where to start for adapting an existing compute function (also, it is very hard to search for the unique function because of "unique_ptr"). Thanks for any help and advice! Aldrin Montana Computer Science PhD Student UC Santa Cruz
