Please remove me from this email list On Thu, Sep 30, 2021, 11:45 AM Aldrin <[email protected]> wrote:
> Hello! > > I am curious what the recommended approach is for processing values in > arrow arrays/tables? Is a simple for loop fine most of the time, or is the > arrow::Iterator recommended, or is it better to somehow define a compute > function and let arrow do all of the iteration itself? I provide context > below for what I'm trying to do, and hopefully it makes clear why I'm > asking this question, and what it is I'm asking. > > For reference, I am trying to remove rows based on duplicates in a > particular column. There doesn't seem to be a compute function that already > does this, and I can't think of a way to compose existing functions to get > what I need. I can think of a simple approach I can implement, and an > approach that requires a slight modification of an existing compute > function. > > The simple approach I can think of is to: (1) take the column of interest, > (2) iterate over it and note the indices of values to drop, and (3) use > Table::Slice and arrow::ConcatenateTables to produce a result Table. This > feels like I'm missing out on some things that arrow may provide at least > for the 2nd step. > > The better approach to step (2) above, would be to use > arrow::compute::Unique, but instead of producing unique values, produce > indexes. This way I could perhaps also setup a function Options that could > choose to keep the first duplicate, or keep the last, etc. > > My C++ is not particularly advanced, so I find it hard to know where to > start for adapting an existing compute function (also, it is very hard to > search for the unique function because of "unique_ptr"). > > Thanks for any help and advice! > > Aldrin Montana > Computer Science PhD Student > UC Santa Cruz >
