Hi Andrew,

Although Arrow doesn't support such indirections natively, it might still
be possible to use Arrow to store your data. One possible approach is to
store all points in a Table or a RecordBatch [1], then keep an Int64 index
array for the indirection.

1. You can always access the "real" data with its index in O(1) time.
However, since arrow is a column storage format, you need to access each
field separately.
2. You can use the compute function Take[2] to materialize the indexed
subset, for example when exporting data.
3. Sorting is a bit tricky. You probably need to implement your own sorting
function.

Be aware that I'm talking about the C++/Python implementation. It may not
be applicable in other languages.

[1] https://arrow.apache.org/docs/cpp/tables.html
[2] https://arrow.apache.org/docs/cpp/compute.html#selections

On Thu, Sep 14, 2023 at 8:42 PM Weston Pace <[email protected]> wrote:

> I'm not entirely sure what kinds of operations you need.
>
> Arrow arrays all (with the exception of RLE) support constant time (O(1))
> random access.  So generally if you want to keep a pointer to a particular
> element or a row of data then that is ok.
>
> On the other hand, you mentioned sorting.  One thing that is a little
> challenging in arrow is swapping two rows of data.  It's very possible, and
> still the same algorithmic complexity (O(# columns)) as a row based format
> but it is not as memory efficient. Because you are doing a separate memory
> swap for each array.
>
> This is why arrow compute libraries will sometimes convert to a row based
> format for certain operations.
>
> On Thu, Sep 14, 2023, 8:21 AM Andrew Bell <[email protected]>
> wrote:
>
>> Hi,
>>
>> We have a data structure that stores points in a point cloud (X, Y, Z,
>> attributes) and we have been approached about replacing the current memory
>> store with Arrow. The issue is that the current data store also has a set
>> of pointers (indirection) that allows for things like subsetting and
>> sorting while keeping the data in place. All data is accessed through the
>> indirection table. What people typically want is to export one or more of
>> these data sets specified by the pointers.
>>
>> My understanding is that Arrow doesn't support such a scheme as the point
>> of the structure is to allow SIMD and other optimizations gained by
>> processing contiguous data. Am I missing something in my reading of the
>> Arrow docs? Does anyone have thoughts/recommendations, or is Arrow just not
>> a good fit for this kind of thing?
>>
>> Thanks,
>>
>> --
>> Andrew Bell
>> [email protected]
>>
>

Reply via email to