Hi Andrew, Although Arrow doesn't support such indirections natively, it might still be possible to use Arrow to store your data. One possible approach is to store all points in a Table or a RecordBatch [1], then keep an Int64 index array for the indirection.
1. You can always access the "real" data with its index in O(1) time. However, since arrow is a column storage format, you need to access each field separately. 2. You can use the compute function Take[2] to materialize the indexed subset, for example when exporting data. 3. Sorting is a bit tricky. You probably need to implement your own sorting function. Be aware that I'm talking about the C++/Python implementation. It may not be applicable in other languages. [1] https://arrow.apache.org/docs/cpp/tables.html [2] https://arrow.apache.org/docs/cpp/compute.html#selections On Thu, Sep 14, 2023 at 8:42 PM Weston Pace <[email protected]> wrote: > I'm not entirely sure what kinds of operations you need. > > Arrow arrays all (with the exception of RLE) support constant time (O(1)) > random access. So generally if you want to keep a pointer to a particular > element or a row of data then that is ok. > > On the other hand, you mentioned sorting. One thing that is a little > challenging in arrow is swapping two rows of data. It's very possible, and > still the same algorithmic complexity (O(# columns)) as a row based format > but it is not as memory efficient. Because you are doing a separate memory > swap for each array. > > This is why arrow compute libraries will sometimes convert to a row based > format for certain operations. > > On Thu, Sep 14, 2023, 8:21 AM Andrew Bell <[email protected]> > wrote: > >> Hi, >> >> We have a data structure that stores points in a point cloud (X, Y, Z, >> attributes) and we have been approached about replacing the current memory >> store with Arrow. The issue is that the current data store also has a set >> of pointers (indirection) that allows for things like subsetting and >> sorting while keeping the data in place. All data is accessed through the >> indirection table. What people typically want is to export one or more of >> these data sets specified by the pointers. >> >> My understanding is that Arrow doesn't support such a scheme as the point >> of the structure is to allow SIMD and other optimizations gained by >> processing contiguous data. Am I missing something in my reading of the >> Arrow docs? Does anyone have thoughts/recommendations, or is Arrow just not >> a good fit for this kind of thing? >> >> Thanks, >> >> -- >> Andrew Bell >> [email protected] >> >
