Re: Arrow with indirection?

Andrew Bell Thu, 14 Sep 2023 09:48:31 -0700

Thanks for the thoughts. CPU performance/design may have made what seemed
an important consideration some years ago not that valuable at this point.
With SIMD everywhere and RISC-V vectorization becoming a real thing, I may
try ditching the indirection and use Arrow directly.  This potentially
means making a bunch of copies, but perhaps the processing will be enough
faster in the important cases that it won't matter.  This is a game of
tradeoffs.


On Thu, Sep 14, 2023 at 10:50 AM Jin Shang <[email protected]> wrote:

> Hi Andrew,
>
> Although Arrow doesn't support such indirections natively, it might still
> be possible to use Arrow to store your data. One possible approach is to
> store all points in a Table or a RecordBatch [1], then keep an Int64 index
> array for the indirection.
>
> 1. You can always access the "real" data with its index in O(1) time.
> However, since arrow is a column storage format, you need to access each
> field separately.
> 2. You can use the compute function Take[2] to materialize the indexed
> subset, for example when exporting data.
> 3. Sorting is a bit tricky. You probably need to implement your own
> sorting function.
>
> Be aware that I'm talking about the C++/Python implementation. It may not
> be applicable in other languages.
>
> [1] https://arrow.apache.org/docs/cpp/tables.html
> [2] https://arrow.apache.org/docs/cpp/compute.html#selections
>
> On Thu, Sep 14, 2023 at 8:42 PM Weston Pace <[email protected]> wrote:
>
>> I'm not entirely sure what kinds of operations you need.
>>
>> Arrow arrays all (with the exception of RLE) support constant time (O(1))
>> random access.  So generally if you want to keep a pointer to a particular
>> element or a row of data then that is ok.
>>
>> On the other hand, you mentioned sorting.  One thing that is a little
>> challenging in arrow is swapping two rows of data.  It's very possible, and
>> still the same algorithmic complexity (O(# columns)) as a row based format
>> but it is not as memory efficient. Because you are doing a separate memory
>> swap for each array.
>>
>> This is why arrow compute libraries will sometimes convert to a row based
>> format for certain operations.
>>
>> On Thu, Sep 14, 2023, 8:21 AM Andrew Bell <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> We have a data structure that stores points in a point cloud (X, Y, Z,
>>> attributes) and we have been approached about replacing the current memory
>>> store with Arrow. The issue is that the current data store also has a set
>>> of pointers (indirection) that allows for things like subsetting and
>>> sorting while keeping the data in place. All data is accessed through the
>>> indirection table. What people typically want is to export one or more of
>>> these data sets specified by the pointers.
>>>
>>> My understanding is that Arrow doesn't support such a scheme as the
>>> point of the structure is to allow SIMD and other optimizations gained by
>>> processing contiguous data. Am I missing something in my reading of the
>>> Arrow docs? Does anyone have thoughts/recommendations, or is Arrow just not
>>> a good fit for this kind of thing?
>>>
>>> Thanks,
>>>
>>> --
>>> Andrew Bell
>>> [email protected]
>>>
>>

-- 
Andrew Bell
[email protected]

Re: Arrow with indirection?

Reply via email to