I just created https://issues.apache.org/jira/browse/ARROW-11989
> On Mar 16, 2021, at 6:54 PM, Wes McKinney <[email protected]> wrote: > > Is there a Jira tracking this performance improvement? At minimum > getting to O(log k) indexing time where k is the number of chunks > would be a good goal > > On Mon, Mar 15, 2021 at 8:05 PM Micah Kornfield <[email protected]> wrote: >> >> One more micro optimization would be to use interpolation search instead of >> binary search (haven't checked if this is what the compute code does) >> >> On Monday, March 15, 2021, Weston Pace <[email protected]> wrote: >>> >>> Since most of the time is probably spent loading each length from RAM >>> (the lengths aren't contiguous and given the size of the chunks they >>> are probably pretty far apart) you can even get significant speedup >>> just by using a contiguous array of lengths in your index. Note, this >>> can be combined with Antoine's suggestion. I did a quick >>> micro-benchmark and by the time I got to 32 chunks it was already 5x >>> slower. With a contiguous array of lengths it was still within 20% of >>> the original time. >>> >>> On Mon, Mar 15, 2021 at 8:14 AM Antoine Pitrou <[email protected]> wrote: >>>> >>>> On Mon, 15 Mar 2021 11:02:09 -0700 >>>> Micah Kornfield <[email protected]> wrote: >>>>>> >>>>>> Do you know if the iteration is done on the python side, or on the C++ >>>>>> side ? >>>>> >>>>> It appears to be in cython [1] which looking at the definition, I would >>>>> expect to compile down to pretty straightforward C code. >>>>> >>>>> May I create a post on JIRA about adding an indexing structure for >>>>>> ChunkedArray ? >>>>> >>>>> Yes, please do, I'm sure others will have thoughts on the best way to >>>>> incorporate this (it might also be a good first contribution to the >>>>> project >>>>> if you are interested). I think also providing some context from this >>>>> thread on the relative slowness would be good (the bottleneck still might >>>>> be something else, that others more familiar with the code could point >>>>> to). >>>> >>>> Just for the record, there is already something like this in the >>>> compute layer. The general idea can probably be reused. >>>> >>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94 >>>> >>>> Regards >>>> >>>> Antoine. >>>> >>>>
