Is there a Jira tracking this performance improvement? At minimum getting to O(log k) indexing time where k is the number of chunks would be a good goal
On Mon, Mar 15, 2021 at 8:05 PM Micah Kornfield <[email protected]> wrote: > > One more micro optimization would be to use interpolation search instead of > binary search (haven't checked if this is what the compute code does) > > On Monday, March 15, 2021, Weston Pace <[email protected]> wrote: >> >> Since most of the time is probably spent loading each length from RAM >> (the lengths aren't contiguous and given the size of the chunks they >> are probably pretty far apart) you can even get significant speedup >> just by using a contiguous array of lengths in your index. Note, this >> can be combined with Antoine's suggestion. I did a quick >> micro-benchmark and by the time I got to 32 chunks it was already 5x >> slower. With a contiguous array of lengths it was still within 20% of >> the original time. >> >> On Mon, Mar 15, 2021 at 8:14 AM Antoine Pitrou <[email protected]> wrote: >> > >> > On Mon, 15 Mar 2021 11:02:09 -0700 >> > Micah Kornfield <[email protected]> wrote: >> > > > >> > > > Do you know if the iteration is done on the python side, or on the C++ >> > > > side ? >> > > >> > > It appears to be in cython [1] which looking at the definition, I would >> > > expect to compile down to pretty straightforward C code. >> > > >> > > May I create a post on JIRA about adding an indexing structure for >> > > > ChunkedArray ? >> > > >> > > Yes, please do, I'm sure others will have thoughts on the best way to >> > > incorporate this (it might also be a good first contribution to the >> > > project >> > > if you are interested). I think also providing some context from this >> > > thread on the relative slowness would be good (the bottleneck still might >> > > be something else, that others more familiar with the code could point >> > > to). >> > >> > Just for the record, there is already something like this in the >> > compute layer. The general idea can probably be reused. >> > >> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94 >> > >> > Regards >> > >> > Antoine. >> > >> >
