Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient
Thanks, I look forward to this being improved, while I have little availability to help myself atm. On 2 August 2015 at 22:58, Maheshakya Wijewardena pmaheshak...@gmail.com wrote: I agree with Joel. Profiling indicated that 69.8% of total time of kneighbors is spent on _find_matching_indices and 22.9% is spent on _compute_distances. So I'll give priority to work on _find_matching_indices with the method you suggested. On Sun, Aug 2, 2015 at 10:51 AM, Maheshakya Wijewardena pmaheshak...@gmail.com wrote: Hi Joel, I was on vacation during past 3 days. I''ll look into this asap and let you all know. I also did some profiling, but only with the usage of `pairwise_distance` method. Brute force technique directly uses that for the entire query array, but LSH uses that in a loop and I noticed there is a huge lag. I'll first confirm your claims. I can start working on this but I think I'll need your or some other contributers' reviewing as well . I'll do this if it's possible. On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman joel.noth...@gmail.com wrote: @Maheshakya, will you be able to do work in the near future on speeding up the ascending phase instead? Or should another contributor take that baton? Not only does it seem to be a major contributor to runtime, but it is independent of metric and hashing mechanism (within binary hashes), and hence the most fundamental component of LSHForest.T On 30 July 2015 at 22:28, Joel Nothman joel.noth...@gmail.com wrote: (sorry, I should have said the first b layers, not 2**b layers, producing a memoization of 2**b offsets On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com wrote: One approach to fixing the ascending phase would ensure that _find_matching_indices is only searching over parts of the tree that have not yet been explored, while currently it searches over the entire index at each depth. My preferred, but more experimental, solution is to memoize where the first 2**b layers of the tree begin and end in the index, for small b. So if our index stored: [[0, 0, 0, 1, 1, 0, 0, 0], [0, 0, 1, 0, 1, 0, 1, 0], [0, 0, 1, 0, 1, 0, 1, 0], [0, 1, 0, 0, 0, 0, 0, 0], [0, 1, 0, 1, 1, 0, 0, 0], [0, 1, 1, 0, 0, 1, 1, 0], [1, 0, 0, 0, 0, 1, 0, 1], [1, 0, 0, 1, 0, 1, 0, 1], [1, 1, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 0, 0, 0]] and b=2, we'd memoize offsets for prefixes of size 2: [0, # 00 3, # 01 6, # 10 8, # 11 ] Given a query like [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down to leave the first b bits [0, 1] remaining, and look them up in the array just defined to identify a much narrower search space [3, 6) matching that prefix in the overall index. Indeed, given the min_hash_match constraint, not having this sort of thing for b = min_hash_match seems wasteful. This provides us O(1) access to the top layers of the tree when ascending, and makes the searchsorted calls run in log(n / (2 ** b)) time rather than log(n). It is also much more like traditional LSH. However, it complexifies the code, as we now have to consider two strategies for descent/ascent. On 30 July 2015 at 21:46, Joel Nothman joel.noth...@gmail.com wrote: What makes you think this is the main bottleneck? While it is not an insignificant consumer of time, I really doubt this is what's making scikit-learn's LSH implementation severely underperform with respect to other implementations. We need to profile. In order to do that, we need some sensible parameters that users might actually want, e.g. number of features for {dense, sparse} cases, index size, target 10NN precision and recall (selecting corresponding n_estimators and n_candidates). Ideally we'd consider real-world datasets. And of course, these should be sensible for whichever metric we're operating over, and whether we're doing KNN or Radius searches. I don't know if it's realistic, but I've profiled the following bench_plot_approximate_neighbors settings: Building NearestNeighbors for 10 samples in 100 dimensions LSHF parameters: n_estimators = 15, n_candidates = 100 Building LSHForest for 10 samples in 100 dimensions Done in 1.492s Average time for lshf neighbor queries: 0.005s Average time for exact neighbor queries: 0.002s Average Accuracy : 0.88 Speed up: 0.5x Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query matrix, we have: - 0.03 spent in _query (hashing and descending) - 0.91 spent in _compute_distances (exact distance calculation) - 3.80 remaining in _get_candidates (ascending phase), almost all of which is spent in _find_matching_indices Cutting exact distance calculation to 0s will still not get this faster than the exact approach. Of course, your mileage may vary, but this suggests to me you're barking up the wrong tree (no pun intended). On 30 July 2015 at 19:43, Maheshakya Wijewardena pmaheshak...@gmail.com wrote: Hi, I've started to look into the matter of
Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient
I agree with Joel. Profiling indicated that 69.8% of total time of kneighbors is spent on _find_matching_indices and 22.9% is spent on _compute_distances. So I'll give priority to work on _find_matching_indices with the method you suggested. On Sun, Aug 2, 2015 at 10:51 AM, Maheshakya Wijewardena pmaheshak...@gmail.com wrote: Hi Joel, I was on vacation during past 3 days. I''ll look into this asap and let you all know. I also did some profiling, but only with the usage of `pairwise_distance` method. Brute force technique directly uses that for the entire query array, but LSH uses that in a loop and I noticed there is a huge lag. I'll first confirm your claims. I can start working on this but I think I'll need your or some other contributers' reviewing as well . I'll do this if it's possible. On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman joel.noth...@gmail.com wrote: @Maheshakya, will you be able to do work in the near future on speeding up the ascending phase instead? Or should another contributor take that baton? Not only does it seem to be a major contributor to runtime, but it is independent of metric and hashing mechanism (within binary hashes), and hence the most fundamental component of LSHForest.T On 30 July 2015 at 22:28, Joel Nothman joel.noth...@gmail.com wrote: (sorry, I should have said the first b layers, not 2**b layers, producing a memoization of 2**b offsets On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com wrote: One approach to fixing the ascending phase would ensure that _find_matching_indices is only searching over parts of the tree that have not yet been explored, while currently it searches over the entire index at each depth. My preferred, but more experimental, solution is to memoize where the first 2**b layers of the tree begin and end in the index, for small b. So if our index stored: [[0, 0, 0, 1, 1, 0, 0, 0], [0, 0, 1, 0, 1, 0, 1, 0], [0, 0, 1, 0, 1, 0, 1, 0], [0, 1, 0, 0, 0, 0, 0, 0], [0, 1, 0, 1, 1, 0, 0, 0], [0, 1, 1, 0, 0, 1, 1, 0], [1, 0, 0, 0, 0, 1, 0, 1], [1, 0, 0, 1, 0, 1, 0, 1], [1, 1, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 0, 0, 0]] and b=2, we'd memoize offsets for prefixes of size 2: [0, # 00 3, # 01 6, # 10 8, # 11 ] Given a query like [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down to leave the first b bits [0, 1] remaining, and look them up in the array just defined to identify a much narrower search space [3, 6) matching that prefix in the overall index. Indeed, given the min_hash_match constraint, not having this sort of thing for b = min_hash_match seems wasteful. This provides us O(1) access to the top layers of the tree when ascending, and makes the searchsorted calls run in log(n / (2 ** b)) time rather than log(n). It is also much more like traditional LSH. However, it complexifies the code, as we now have to consider two strategies for descent/ascent. On 30 July 2015 at 21:46, Joel Nothman joel.noth...@gmail.com wrote: What makes you think this is the main bottleneck? While it is not an insignificant consumer of time, I really doubt this is what's making scikit-learn's LSH implementation severely underperform with respect to other implementations. We need to profile. In order to do that, we need some sensible parameters that users might actually want, e.g. number of features for {dense, sparse} cases, index size, target 10NN precision and recall (selecting corresponding n_estimators and n_candidates). Ideally we'd consider real-world datasets. And of course, these should be sensible for whichever metric we're operating over, and whether we're doing KNN or Radius searches. I don't know if it's realistic, but I've profiled the following bench_plot_approximate_neighbors settings: Building NearestNeighbors for 10 samples in 100 dimensions LSHF parameters: n_estimators = 15, n_candidates = 100 Building LSHForest for 10 samples in 100 dimensions Done in 1.492s Average time for lshf neighbor queries: 0.005s Average time for exact neighbor queries: 0.002s Average Accuracy : 0.88 Speed up: 0.5x Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query matrix, we have: - 0.03 spent in _query (hashing and descending) - 0.91 spent in _compute_distances (exact distance calculation) - 3.80 remaining in _get_candidates (ascending phase), almost all of which is spent in _find_matching_indices Cutting exact distance calculation to 0s will still not get this faster than the exact approach. Of course, your mileage may vary, but this suggests to me you're barking up the wrong tree (no pun intended). On 30 July 2015 at 19:43, Maheshakya Wijewardena pmaheshak...@gmail.com wrote: Hi, I've started to look into the matter of improving performance of LSHForest. As we have discussed sometime before(in fact, quite a long time), main concern is to Cythonize distance calculations. Currently, this done by