Thanks for the thoughts. CPU performance/design may have made what seemed an important consideration some years ago not that valuable at this point. With SIMD everywhere and RISC-V vectorization becoming a real thing, I may try ditching the indirection and use Arrow directly. This potentially means making a bunch of copies, but perhaps the processing will be enough faster in the important cases that it won't matter. This is a game of tradeoffs.
On Thu, Sep 14, 2023 at 10:50 AM Jin Shang <[email protected]> wrote: > Hi Andrew, > > Although Arrow doesn't support such indirections natively, it might still > be possible to use Arrow to store your data. One possible approach is to > store all points in a Table or a RecordBatch [1], then keep an Int64 index > array for the indirection. > > 1. You can always access the "real" data with its index in O(1) time. > However, since arrow is a column storage format, you need to access each > field separately. > 2. You can use the compute function Take[2] to materialize the indexed > subset, for example when exporting data. > 3. Sorting is a bit tricky. You probably need to implement your own > sorting function. > > Be aware that I'm talking about the C++/Python implementation. It may not > be applicable in other languages. > > [1] https://arrow.apache.org/docs/cpp/tables.html > [2] https://arrow.apache.org/docs/cpp/compute.html#selections > > On Thu, Sep 14, 2023 at 8:42 PM Weston Pace <[email protected]> wrote: > >> I'm not entirely sure what kinds of operations you need. >> >> Arrow arrays all (with the exception of RLE) support constant time (O(1)) >> random access. So generally if you want to keep a pointer to a particular >> element or a row of data then that is ok. >> >> On the other hand, you mentioned sorting. One thing that is a little >> challenging in arrow is swapping two rows of data. It's very possible, and >> still the same algorithmic complexity (O(# columns)) as a row based format >> but it is not as memory efficient. Because you are doing a separate memory >> swap for each array. >> >> This is why arrow compute libraries will sometimes convert to a row based >> format for certain operations. >> >> On Thu, Sep 14, 2023, 8:21 AM Andrew Bell <[email protected]> >> wrote: >> >>> Hi, >>> >>> We have a data structure that stores points in a point cloud (X, Y, Z, >>> attributes) and we have been approached about replacing the current memory >>> store with Arrow. The issue is that the current data store also has a set >>> of pointers (indirection) that allows for things like subsetting and >>> sorting while keeping the data in place. All data is accessed through the >>> indirection table. What people typically want is to export one or more of >>> these data sets specified by the pointers. >>> >>> My understanding is that Arrow doesn't support such a scheme as the >>> point of the structure is to allow SIMD and other optimizations gained by >>> processing contiguous data. Am I missing something in my reading of the >>> Arrow docs? Does anyone have thoughts/recommendations, or is Arrow just not >>> a good fit for this kind of thing? >>> >>> Thanks, >>> >>> -- >>> Andrew Bell >>> [email protected] >>> >> -- Andrew Bell [email protected]
