Re: [Python] Why is the access to values in ChunkedArray O(n_chunks) ?

Quentin Lhoest Tue, 16 Mar 2021 11:30:45 -0700

I just created https://issues.apache.org/jira/browse/ARROW-11989


> On Mar 16, 2021, at 6:54 PM, Wes McKinney <[email protected]> wrote:
> 
> Is there a Jira tracking this performance improvement? At minimum
> getting to O(log k) indexing time where k is the number of chunks
> would be a good goal
> 
> On Mon, Mar 15, 2021 at 8:05 PM Micah Kornfield <[email protected]> wrote:
>> 
>> One more micro optimization would be to use interpolation search instead of 
>> binary search (haven't checked if this is what the compute code does)
>> 
>> On Monday, March 15, 2021, Weston Pace <[email protected]> wrote:
>>> 
>>> Since most of the time is probably spent loading each length from RAM
>>> (the lengths aren't contiguous and given the size of the chunks they
>>> are probably pretty far apart) you can even get significant speedup
>>> just by using a contiguous array of lengths in your index.  Note, this
>>> can be combined with Antoine's suggestion.  I did a quick
>>> micro-benchmark and by the time I got to 32 chunks it was already 5x
>>> slower.  With a contiguous array of lengths it was still within 20% of
>>> the original time.
>>> 
>>> On Mon, Mar 15, 2021 at 8:14 AM Antoine Pitrou <[email protected]> wrote:
>>>> 
>>>> On Mon, 15 Mar 2021 11:02:09 -0700
>>>> Micah Kornfield <[email protected]> wrote:
>>>>>> 
>>>>>> Do you know if the iteration is done on the python side, or on the C++
>>>>>> side ?
>>>>> 
>>>>> It appears to be in cython [1] which looking at the definition, I would
>>>>> expect to compile down to pretty straightforward C code.
>>>>> 
>>>>> May I create a post on JIRA about adding an indexing structure for
>>>>>> ChunkedArray ?
>>>>> 
>>>>> Yes, please do, I'm sure others will have thoughts on the best way to
>>>>> incorporate this (it might also be a good first contribution to the 
>>>>> project
>>>>> if you are interested).  I think also providing some context from this
>>>>> thread on the relative slowness would be good (the bottleneck still might
>>>>> be something else, that others more familiar with the code could point 
>>>>> to).
>>>> 
>>>> Just for the record, there is already something like this in the
>>>> compute layer.  The general idea can probably be reused.
>>>> 
>>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94
>>>> 
>>>> Regards
>>>> 
>>>> Antoine.
>>>> 
>>>>

Re: [Python] Why is the access to values in ChunkedArray O(n_chunks) ?

Reply via email to