Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-06 Thread Maheshakya Wijewardena
I did a rough implementation, setting b = min_hash_match. The result I got from running the benchmark is attached. It was able to roughly triple the speed-up of kneighbors function for large index sizes. However, this implementation adds some overhead to fitting time as there are 2**b *

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-06 Thread Joel Nothman
It's nice to see some decent speed-up factors, though the accuracy tradeoff is still not so great. Still, I'd like to see the code and where we can go from here. Great work so far! On 7 August 2015 at 07:50, Maheshakya Wijewardena pmaheshak...@gmail.com wrote: I did a rough implementation,

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-02 Thread Joel Nothman
Thanks, I look forward to this being improved, while I have little availability to help myself atm. On 2 August 2015 at 22:58, Maheshakya Wijewardena pmaheshak...@gmail.com wrote: I agree with Joel. Profiling indicated that 69.8% of total time of kneighbors is spent on _find_matching_indices

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-02 Thread Maheshakya Wijewardena
I agree with Joel. Profiling indicated that 69.8% of total time of kneighbors is spent on _find_matching_indices and 22.9% is spent on _compute_distances. So I'll give priority to work on _find_matching_indices with the method you suggested. On Sun, Aug 2, 2015 at 10:51 AM, Maheshakya Wijewardena

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-01 Thread Maheshakya Wijewardena
Hi Joel, I was on vacation during past 3 days. I''ll look into this asap and let you all know. I also did some profiling, but only with the usage of `pairwise_distance` method. Brute force technique directly uses that for the entire query array, but LSH uses that in a loop and I noticed there is

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-01 Thread Joel Nothman
@Maheshakya, will you be able to do work in the near future on speeding up the ascending phase instead? Or should another contributor take that baton? Not only does it seem to be a major contributor to runtime, but it is independent of metric and hashing mechanism (within binary hashes), and hence

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Gael Varoquaux
Hi, This is absolutely great news. Thanks a lot. Please do open a WIP PR. We (at INRIA) were planning to allocate time from someone this summer to work on this. So you'll have someone reviewing / advicing. With regards to releasing the gil, you need to use the 'with nogil' statement in cython.

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Joel Nothman
What makes you think this is the main bottleneck? While it is not an insignificant consumer of time, I really doubt this is what's making scikit-learn's LSH implementation severely underperform with respect to other implementations. We need to profile. In order to do that, we need some sensible

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Joel Nothman
One approach to fixing the ascending phase would ensure that _find_matching_indices is only searching over parts of the tree that have not yet been explored, while currently it searches over the entire index at each depth. My preferred, but more experimental, solution is to memoize where the

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Joel Nothman
(sorry, I should have said the first b layers, not 2**b layers, producing a memoization of 2**b offsets) On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com wrote: One approach to fixing the ascending phase would ensure that _find_matching_indices is only searching over parts of the

[Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Maheshakya Wijewardena
Hi, I've started to look into the matter of improving performance of LSHForest. As we have discussed sometime before(in fact, quite a long time), main concern is to Cythonize distance calculations. Currently, this done by iteratively moving over all the query vectors when `kneighbors` method is