Re: Arrow with indirection?

Jin Shang Thu, 14 Sep 2023 07:52:01 -0700

Hi Andrew,

Although Arrow doesn't support such indirections natively, it might still
be possible to use Arrow to store your data. One possible approach is to
store all points in a Table or a RecordBatch [1], then keep an Int64 index
array for the indirection.


1. You can always access the "real" data with its index in O(1) time.
However, since arrow is a column storage format, you need to access each
field separately.
2. You can use the compute function Take[2] to materialize the indexed
subset, for example when exporting data.
3. Sorting is a bit tricky. You probably need to implement your own sorting
function.

Be aware that I'm talking about the C++/Python implementation. It may not
be applicable in other languages.

[1] https://arrow.apache.org/docs/cpp/tables.html
[2] https://arrow.apache.org/docs/cpp/compute.html#selections

On Thu, Sep 14, 2023 at 8:42 PM Weston Pace <[email protected]> wrote:

> I'm not entirely sure what kinds of operations you need.
>
> Arrow arrays all (with the exception of RLE) support constant time (O(1))
> random access.  So generally if you want to keep a pointer to a particular
> element or a row of data then that is ok.
>
> On the other hand, you mentioned sorting.  One thing that is a little
> challenging in arrow is swapping two rows of data.  It's very possible, and
> still the same algorithmic complexity (O(# columns)) as a row based format
> but it is not as memory efficient. Because you are doing a separate memory
> swap for each array.
>
> This is why arrow compute libraries will sometimes convert to a row based
> format for certain operations.
>
> On Thu, Sep 14, 2023, 8:21 AM Andrew Bell <[email protected]>
> wrote:
>
>> Hi,
>>
>> We have a data structure that stores points in a point cloud (X, Y, Z,
>> attributes) and we have been approached about replacing the current memory
>> store with Arrow. The issue is that the current data store also has a set
>> of pointers (indirection) that allows for things like subsetting and
>> sorting while keeping the data in place. All data is accessed through the
>> indirection table. What people typically want is to export one or more of
>> these data sets specified by the pointers.
>>
>> My understanding is that Arrow doesn't support such a scheme as the point
>> of the structure is to allow SIMD and other optimizations gained by
>> processing contiguous data. Am I missing something in my reading of the
>> Arrow docs? Does anyone have thoughts/recommendations, or is Arrow just not
>> a good fit for this kind of thing?
>>
>> Thanks,
>>
>> --
>> Andrew Bell
>> [email protected]
>>
>

Re: Arrow with indirection?

Reply via email to