Hello mailing list,

I would like to open a PR that improves the speed of intersect1d in particular 
for large arrays of potentially unequal sizes. As parts of the change would 
involve a new keyword I think I should write to this mailing list first 
(correct me if I am wrong).


Lets start with the current implementation of intersect1d. The steps are 
(roughly): I) concatenate both arrays II) sort that large array and III) keep 
only duplicates. This is a simple but not necessary fast implementation. In 
particular it has has complexity O( (len1 + len2) log(len1 + len2)).


I have two suggestions one of which can greatly improve the runtime (1) while 
the other has some minor improvements (2). Lets start with the major one.

1) Adding assume_sorted as keyword argument.
    If you can already assume that the arrays are sorted you can improve the 
runtime to min(len1, len2)*log(max(len1, len2)) which is orders of magnitude 
faster especially when the arrays intersected are unequal in size.
    The implementation basically is I) figure out which is the smaller array 
II) make that array unique III) use np.searchsorted to find duplicates of the 
smaller in the larger.

    I think that having assume_sorted as a keyword argument is quite useful, 
especially as the result of assume_sorted is always sorted. So chaining 
multiple intersections together can be made much faster.


2) Improve the performance in the case that we cannot assume sorted arrays

     In the case that we cannot assume that both arrays are already sorted we 
can still gain advantages. Mainly by sorting the arrays separately and then 
using np.searchsorted again. This brings the complexity down to O( len1 
log(len1)  + len2 log(len2)) + min(len1, len2)*log(max(len1, len2)).


I already mention this in the issue 16964: 
https://github.com/numpy/numpy/issues/16964


Looking forward for your feedback.


Have a great day and stay safe


Felix Stamm

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Reply via email to