Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-06 Thread Maheshakya Wijewardena
I did a rough implementation, setting b = min_hash_match. The result I got
from running the benchmark is attached. It was able to roughly triple the
speed-up of kneighbors function for large index sizes. However, this
implementation adds some overhead to fitting time as there are 2**b *
n_estimators times numpy searchsorted calls during training. But that may
most probably be compensated as the number of queries grow since 2**b *
n_estimators is a constant time.

I'll send a PR with proper refactoring.


On Sun, Aug 2, 2015 at 6:41 PM, Joel Nothman joel.noth...@gmail.com wrote:

 Thanks, I look forward to this being improved, while I have little
 availability to help myself atm.

 On 2 August 2015 at 22:58, Maheshakya Wijewardena pmaheshak...@gmail.com
 wrote:

 I agree with Joel. Profiling indicated that 69.8% of total time of
 kneighbors is spent on _find_matching_indices and 22.9% is spent on
 _compute_distances. So I'll give priority to work on _find_matching_indices
 with the method you suggested.

 On Sun, Aug 2, 2015 at 10:51 AM, Maheshakya Wijewardena 
 pmaheshak...@gmail.com wrote:

 Hi Joel,
 I was on vacation during past 3 days. I''ll look into this asap and let
 you all know.

 I also did some profiling, but only with the usage of
 `pairwise_distance` method. Brute force technique directly uses that for
 the entire query array, but LSH uses that in a loop and I noticed there is
 a huge lag. I'll first confirm your claims. I can start working on this but
 I think I'll need your or some other contributers' reviewing as well . I'll
 do this if it's possible.

 On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman joel.noth...@gmail.com
 wrote:

 @Maheshakya, will you be able to do work in the near future on speeding
 up the ascending phase instead? Or should another contributor take that
 baton? Not only does it seem to be a major contributor to runtime, but it
 is independent of metric and hashing mechanism (within binary hashes), and
 hence the most fundamental component of LSHForest.T

 On 30 July 2015 at 22:28, Joel Nothman joel.noth...@gmail.com wrote:

 (sorry, I should have said the first b layers, not 2**b layers,
 producing a memoization of 2**b offsets

 On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com wrote:

 One approach to fixing the ascending phase would ensure that
 _find_matching_indices is only searching over parts of the tree that have
 not yet been explored, while currently it searches over the entire index 
 at
 each depth.

 My preferred, but more experimental, solution is to memoize where the
 first 2**b layers of the tree begin and end in the index, for small b. So
 if our index stored:
 [[0, 0, 0, 1, 1, 0, 0, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 1, 0, 0, 0, 0, 0, 0],
  [0, 1, 0, 1, 1, 0, 0, 0],
  [0, 1, 1, 0, 0, 1, 1, 0],
  [1, 0, 0, 0, 0, 1, 0, 1],
  [1, 0, 0, 1, 0, 1, 0, 1],
  [1, 1, 0, 0, 0, 0, 0, 0],
  [1, 1, 1, 1, 1, 0, 0, 0]]
 and b=2, we'd memoize offsets for prefixes of size 2:
 [0, # 00
  3, # 01
  6, # 10
  8, # 11
 ]

 Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down
 to leave the first b bits [0, 1] remaining, and look them up in the array
 just defined to identify a much narrower search space [3, 6) matching 
 that
 prefix in the overall index.

 Indeed, given the min_hash_match constraint, not having this sort of
 thing for b = min_hash_match seems wasteful.

 This provides us O(1) access to the top layers of the tree when
 ascending, and makes the searchsorted calls run in log(n / (2 ** b)) time
 rather than log(n). It is also much more like traditional LSH. However, 
 it
 complexifies the code, as we now have to consider two strategies for
 descent/ascent.



 On 30 July 2015 at 21:46, Joel Nothman joel.noth...@gmail.com
 wrote:

 What makes you think this is the main bottleneck? While it is not an
 insignificant consumer of time, I really doubt this is what's making
 scikit-learn's LSH implementation severely underperform with respect to
 other implementations.

 We need to profile. In order to do that, we need some sensible
 parameters that users might actually want, e.g. number of features for
 {dense, sparse} cases, index size, target 10NN precision and recall
 (selecting corresponding n_estimators and n_candidates). Ideally we'd
 consider real-world datasets. And of course, these should be sensible 
 for
 whichever metric we're operating over, and whether we're doing KNN or
 Radius searches.

 I don't know if it's realistic, but I've profiled the following
 bench_plot_approximate_neighbors settings:

 Building NearestNeighbors for 10 samples in 100 dimensions
 LSHF parameters: n_estimators = 15, n_candidates = 100
 Building LSHForest for 10 samples in 100 dimensions
 Done in 1.492s
 Average time for lshf neighbor queries: 0.005s
 Average time for exact neighbor queries: 0.002s
 Average Accuracy : 0.88
 Speed up: 0.5x

 Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query
 

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-06 Thread Joel Nothman
It's nice to see some decent speed-up factors, though the accuracy tradeoff
is still not so great. Still, I'd like to see the code and where we can go
from here. Great work so far!

On 7 August 2015 at 07:50, Maheshakya Wijewardena pmaheshak...@gmail.com
wrote:

 I did a rough implementation, setting b = min_hash_match. The result I got
 from running the benchmark is attached. It was able to roughly triple the
 speed-up of kneighbors function for large index sizes. However, this
 implementation adds some overhead to fitting time as there are 2**b *
 n_estimators times numpy searchsorted calls during training. But that may
 most probably be compensated as the number of queries grow since 2**b *
 n_estimators is a constant time.

 I'll send a PR with proper refactoring.


 On Sun, Aug 2, 2015 at 6:41 PM, Joel Nothman joel.noth...@gmail.com
 wrote:

 Thanks, I look forward to this being improved, while I have little
 availability to help myself atm.

 On 2 August 2015 at 22:58, Maheshakya Wijewardena pmaheshak...@gmail.com
  wrote:

 I agree with Joel. Profiling indicated that 69.8% of total time of
 kneighbors is spent on _find_matching_indices and 22.9% is spent on
 _compute_distances. So I'll give priority to work on _find_matching_indices
 with the method you suggested.

 On Sun, Aug 2, 2015 at 10:51 AM, Maheshakya Wijewardena 
 pmaheshak...@gmail.com wrote:

 Hi Joel,
 I was on vacation during past 3 days. I''ll look into this asap and let
 you all know.

 I also did some profiling, but only with the usage of
 `pairwise_distance` method. Brute force technique directly uses that for
 the entire query array, but LSH uses that in a loop and I noticed there is
 a huge lag. I'll first confirm your claims. I can start working on this but
 I think I'll need your or some other contributers' reviewing as well . I'll
 do this if it's possible.

 On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman joel.noth...@gmail.com
 wrote:

 @Maheshakya, will you be able to do work in the near future on
 speeding up the ascending phase instead? Or should another contributor 
 take
 that baton? Not only does it seem to be a major contributor to runtime, 
 but
 it is independent of metric and hashing mechanism (within binary hashes),
 and hence the most fundamental component of LSHForest.T

 On 30 July 2015 at 22:28, Joel Nothman joel.noth...@gmail.com wrote:

 (sorry, I should have said the first b layers, not 2**b layers,
 producing a memoization of 2**b offsets

 On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com
 wrote:

 One approach to fixing the ascending phase would ensure that
 _find_matching_indices is only searching over parts of the tree that 
 have
 not yet been explored, while currently it searches over the entire 
 index at
 each depth.

 My preferred, but more experimental, solution is to memoize where
 the first 2**b layers of the tree begin and end in the index, for small 
 b.
 So if our index stored:
 [[0, 0, 0, 1, 1, 0, 0, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 1, 0, 0, 0, 0, 0, 0],
  [0, 1, 0, 1, 1, 0, 0, 0],
  [0, 1, 1, 0, 0, 1, 1, 0],
  [1, 0, 0, 0, 0, 1, 0, 1],
  [1, 0, 0, 1, 0, 1, 0, 1],
  [1, 1, 0, 0, 0, 0, 0, 0],
  [1, 1, 1, 1, 1, 0, 0, 0]]
 and b=2, we'd memoize offsets for prefixes of size 2:
 [0, # 00
  3, # 01
  6, # 10
  8, # 11
 ]

 Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift
 down to leave the first b bits [0, 1] remaining, and look them up in the
 array just defined to identify a much narrower search space [3, 6) 
 matching
 that prefix in the overall index.

 Indeed, given the min_hash_match constraint, not having this sort of
 thing for b = min_hash_match seems wasteful.

 This provides us O(1) access to the top layers of the tree when
 ascending, and makes the searchsorted calls run in log(n / (2 ** b)) 
 time
 rather than log(n). It is also much more like traditional LSH. However, 
 it
 complexifies the code, as we now have to consider two strategies for
 descent/ascent.



 On 30 July 2015 at 21:46, Joel Nothman joel.noth...@gmail.com
 wrote:

 What makes you think this is the main bottleneck? While it is not
 an insignificant consumer of time, I really doubt this is what's making
 scikit-learn's LSH implementation severely underperform with respect to
 other implementations.

 We need to profile. In order to do that, we need some sensible
 parameters that users might actually want, e.g. number of features for
 {dense, sparse} cases, index size, target 10NN precision and recall
 (selecting corresponding n_estimators and n_candidates). Ideally we'd
 consider real-world datasets. And of course, these should be sensible 
 for
 whichever metric we're operating over, and whether we're doing KNN or
 Radius searches.

 I don't know if it's realistic, but I've profiled the following
 bench_plot_approximate_neighbors settings:

 Building NearestNeighbors for 10 samples in 100 dimensions
 LSHF parameters: n_estimators = 15, n_candidates = 100
 

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-02 Thread Joel Nothman
Thanks, I look forward to this being improved, while I have little
availability to help myself atm.

On 2 August 2015 at 22:58, Maheshakya Wijewardena pmaheshak...@gmail.com
wrote:

 I agree with Joel. Profiling indicated that 69.8% of total time of
 kneighbors is spent on _find_matching_indices and 22.9% is spent on
 _compute_distances. So I'll give priority to work on _find_matching_indices
 with the method you suggested.

 On Sun, Aug 2, 2015 at 10:51 AM, Maheshakya Wijewardena 
 pmaheshak...@gmail.com wrote:

 Hi Joel,
 I was on vacation during past 3 days. I''ll look into this asap and let
 you all know.

 I also did some profiling, but only with the usage of `pairwise_distance`
 method. Brute force technique directly uses that for the entire query
 array, but LSH uses that in a loop and I noticed there is a huge lag. I'll
 first confirm your claims. I can start working on this but I think I'll
 need your or some other contributers' reviewing as well . I'll do this if
 it's possible.

 On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman joel.noth...@gmail.com
 wrote:

 @Maheshakya, will you be able to do work in the near future on speeding
 up the ascending phase instead? Or should another contributor take that
 baton? Not only does it seem to be a major contributor to runtime, but it
 is independent of metric and hashing mechanism (within binary hashes), and
 hence the most fundamental component of LSHForest.T

 On 30 July 2015 at 22:28, Joel Nothman joel.noth...@gmail.com wrote:

 (sorry, I should have said the first b layers, not 2**b layers,
 producing a memoization of 2**b offsets

 On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com wrote:

 One approach to fixing the ascending phase would ensure that
 _find_matching_indices is only searching over parts of the tree that have
 not yet been explored, while currently it searches over the entire index 
 at
 each depth.

 My preferred, but more experimental, solution is to memoize where the
 first 2**b layers of the tree begin and end in the index, for small b. So
 if our index stored:
 [[0, 0, 0, 1, 1, 0, 0, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 1, 0, 0, 0, 0, 0, 0],
  [0, 1, 0, 1, 1, 0, 0, 0],
  [0, 1, 1, 0, 0, 1, 1, 0],
  [1, 0, 0, 0, 0, 1, 0, 1],
  [1, 0, 0, 1, 0, 1, 0, 1],
  [1, 1, 0, 0, 0, 0, 0, 0],
  [1, 1, 1, 1, 1, 0, 0, 0]]
 and b=2, we'd memoize offsets for prefixes of size 2:
 [0, # 00
  3, # 01
  6, # 10
  8, # 11
 ]

 Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down
 to leave the first b bits [0, 1] remaining, and look them up in the array
 just defined to identify a much narrower search space [3, 6) matching that
 prefix in the overall index.

 Indeed, given the min_hash_match constraint, not having this sort of
 thing for b = min_hash_match seems wasteful.

 This provides us O(1) access to the top layers of the tree when
 ascending, and makes the searchsorted calls run in log(n / (2 ** b)) time
 rather than log(n). It is also much more like traditional LSH. However, it
 complexifies the code, as we now have to consider two strategies for
 descent/ascent.



 On 30 July 2015 at 21:46, Joel Nothman joel.noth...@gmail.com wrote:

 What makes you think this is the main bottleneck? While it is not an
 insignificant consumer of time, I really doubt this is what's making
 scikit-learn's LSH implementation severely underperform with respect to
 other implementations.

 We need to profile. In order to do that, we need some sensible
 parameters that users might actually want, e.g. number of features for
 {dense, sparse} cases, index size, target 10NN precision and recall
 (selecting corresponding n_estimators and n_candidates). Ideally we'd
 consider real-world datasets. And of course, these should be sensible for
 whichever metric we're operating over, and whether we're doing KNN or
 Radius searches.

 I don't know if it's realistic, but I've profiled the following
 bench_plot_approximate_neighbors settings:

 Building NearestNeighbors for 10 samples in 100 dimensions
 LSHF parameters: n_estimators = 15, n_candidates = 100
 Building LSHForest for 10 samples in 100 dimensions
 Done in 1.492s
 Average time for lshf neighbor queries: 0.005s
 Average time for exact neighbor queries: 0.002s
 Average Accuracy : 0.88
 Speed up: 0.5x

 Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query
 matrix, we have:

- 0.03 spent in _query (hashing and descending)
- 0.91 spent in _compute_distances (exact distance calculation)
- 3.80 remaining in _get_candidates (ascending phase), almost all
of which is spent in _find_matching_indices

 Cutting exact distance calculation to 0s will still not get this
 faster than the exact approach. Of course, your mileage may vary, but 
 this
 suggests to me you're barking up the wrong tree (no pun intended).

 On 30 July 2015 at 19:43, Maheshakya Wijewardena 
 pmaheshak...@gmail.com wrote:

 Hi,

 I've started to look into the matter of 

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-02 Thread Maheshakya Wijewardena
I agree with Joel. Profiling indicated that 69.8% of total time of
kneighbors is spent on _find_matching_indices and 22.9% is spent on
_compute_distances. So I'll give priority to work on _find_matching_indices
with the method you suggested.

On Sun, Aug 2, 2015 at 10:51 AM, Maheshakya Wijewardena 
pmaheshak...@gmail.com wrote:

 Hi Joel,
 I was on vacation during past 3 days. I''ll look into this asap and let
 you all know.

 I also did some profiling, but only with the usage of `pairwise_distance`
 method. Brute force technique directly uses that for the entire query
 array, but LSH uses that in a loop and I noticed there is a huge lag. I'll
 first confirm your claims. I can start working on this but I think I'll
 need your or some other contributers' reviewing as well . I'll do this if
 it's possible.

 On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman joel.noth...@gmail.com
 wrote:

 @Maheshakya, will you be able to do work in the near future on speeding
 up the ascending phase instead? Or should another contributor take that
 baton? Not only does it seem to be a major contributor to runtime, but it
 is independent of metric and hashing mechanism (within binary hashes), and
 hence the most fundamental component of LSHForest.T

 On 30 July 2015 at 22:28, Joel Nothman joel.noth...@gmail.com wrote:

 (sorry, I should have said the first b layers, not 2**b layers,
 producing a memoization of 2**b offsets

 On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com wrote:

 One approach to fixing the ascending phase would ensure that
 _find_matching_indices is only searching over parts of the tree that have
 not yet been explored, while currently it searches over the entire index at
 each depth.

 My preferred, but more experimental, solution is to memoize where the
 first 2**b layers of the tree begin and end in the index, for small b. So
 if our index stored:
 [[0, 0, 0, 1, 1, 0, 0, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 1, 0, 0, 0, 0, 0, 0],
  [0, 1, 0, 1, 1, 0, 0, 0],
  [0, 1, 1, 0, 0, 1, 1, 0],
  [1, 0, 0, 0, 0, 1, 0, 1],
  [1, 0, 0, 1, 0, 1, 0, 1],
  [1, 1, 0, 0, 0, 0, 0, 0],
  [1, 1, 1, 1, 1, 0, 0, 0]]
 and b=2, we'd memoize offsets for prefixes of size 2:
 [0, # 00
  3, # 01
  6, # 10
  8, # 11
 ]

 Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down
 to leave the first b bits [0, 1] remaining, and look them up in the array
 just defined to identify a much narrower search space [3, 6) matching that
 prefix in the overall index.

 Indeed, given the min_hash_match constraint, not having this sort of
 thing for b = min_hash_match seems wasteful.

 This provides us O(1) access to the top layers of the tree when
 ascending, and makes the searchsorted calls run in log(n / (2 ** b)) time
 rather than log(n). It is also much more like traditional LSH. However, it
 complexifies the code, as we now have to consider two strategies for
 descent/ascent.



 On 30 July 2015 at 21:46, Joel Nothman joel.noth...@gmail.com wrote:

 What makes you think this is the main bottleneck? While it is not an
 insignificant consumer of time, I really doubt this is what's making
 scikit-learn's LSH implementation severely underperform with respect to
 other implementations.

 We need to profile. In order to do that, we need some sensible
 parameters that users might actually want, e.g. number of features for
 {dense, sparse} cases, index size, target 10NN precision and recall
 (selecting corresponding n_estimators and n_candidates). Ideally we'd
 consider real-world datasets. And of course, these should be sensible for
 whichever metric we're operating over, and whether we're doing KNN or
 Radius searches.

 I don't know if it's realistic, but I've profiled the following
 bench_plot_approximate_neighbors settings:

 Building NearestNeighbors for 10 samples in 100 dimensions
 LSHF parameters: n_estimators = 15, n_candidates = 100
 Building LSHForest for 10 samples in 100 dimensions
 Done in 1.492s
 Average time for lshf neighbor queries: 0.005s
 Average time for exact neighbor queries: 0.002s
 Average Accuracy : 0.88
 Speed up: 0.5x

 Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query
 matrix, we have:

- 0.03 spent in _query (hashing and descending)
- 0.91 spent in _compute_distances (exact distance calculation)
- 3.80 remaining in _get_candidates (ascending phase), almost all
of which is spent in _find_matching_indices

 Cutting exact distance calculation to 0s will still not get this
 faster than the exact approach. Of course, your mileage may vary, but this
 suggests to me you're barking up the wrong tree (no pun intended).

 On 30 July 2015 at 19:43, Maheshakya Wijewardena 
 pmaheshak...@gmail.com wrote:

 Hi,

 I've started to look into the matter of improving performance of
 LSHForest. As we have discussed sometime before(in fact, quite a long
 time), main concern is to Cythonize distance calculations. Currently, 
 this
 done by 

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-01 Thread Maheshakya Wijewardena
Hi Joel,
I was on vacation during past 3 days. I''ll look into this asap and let you
all know.

I also did some profiling, but only with the usage of `pairwise_distance`
method. Brute force technique directly uses that for the entire query
array, but LSH uses that in a loop and I noticed there is a huge lag. I'll
first confirm your claims. I can start working on this but I think I'll
need your or some other contributers' reviewing as well . I'll do this if
it's possible.

On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman joel.noth...@gmail.com wrote:

 @Maheshakya, will you be able to do work in the near future on speeding up
 the ascending phase instead? Or should another contributor take that baton?
 Not only does it seem to be a major contributor to runtime, but it is
 independent of metric and hashing mechanism (within binary hashes), and
 hence the most fundamental component of LSHForest.T

 On 30 July 2015 at 22:28, Joel Nothman joel.noth...@gmail.com wrote:

 (sorry, I should have said the first b layers, not 2**b layers, producing
 a memoization of 2**b offsets

 On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com wrote:

 One approach to fixing the ascending phase would ensure that
 _find_matching_indices is only searching over parts of the tree that have
 not yet been explored, while currently it searches over the entire index at
 each depth.

 My preferred, but more experimental, solution is to memoize where the
 first 2**b layers of the tree begin and end in the index, for small b. So
 if our index stored:
 [[0, 0, 0, 1, 1, 0, 0, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 1, 0, 0, 0, 0, 0, 0],
  [0, 1, 0, 1, 1, 0, 0, 0],
  [0, 1, 1, 0, 0, 1, 1, 0],
  [1, 0, 0, 0, 0, 1, 0, 1],
  [1, 0, 0, 1, 0, 1, 0, 1],
  [1, 1, 0, 0, 0, 0, 0, 0],
  [1, 1, 1, 1, 1, 0, 0, 0]]
 and b=2, we'd memoize offsets for prefixes of size 2:
 [0, # 00
  3, # 01
  6, # 10
  8, # 11
 ]

 Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down to
 leave the first b bits [0, 1] remaining, and look them up in the array just
 defined to identify a much narrower search space [3, 6) matching that
 prefix in the overall index.

 Indeed, given the min_hash_match constraint, not having this sort of
 thing for b = min_hash_match seems wasteful.

 This provides us O(1) access to the top layers of the tree when
 ascending, and makes the searchsorted calls run in log(n / (2 ** b)) time
 rather than log(n). It is also much more like traditional LSH. However, it
 complexifies the code, as we now have to consider two strategies for
 descent/ascent.



 On 30 July 2015 at 21:46, Joel Nothman joel.noth...@gmail.com wrote:

 What makes you think this is the main bottleneck? While it is not an
 insignificant consumer of time, I really doubt this is what's making
 scikit-learn's LSH implementation severely underperform with respect to
 other implementations.

 We need to profile. In order to do that, we need some sensible
 parameters that users might actually want, e.g. number of features for
 {dense, sparse} cases, index size, target 10NN precision and recall
 (selecting corresponding n_estimators and n_candidates). Ideally we'd
 consider real-world datasets. And of course, these should be sensible for
 whichever metric we're operating over, and whether we're doing KNN or
 Radius searches.

 I don't know if it's realistic, but I've profiled the following
 bench_plot_approximate_neighbors settings:

 Building NearestNeighbors for 10 samples in 100 dimensions
 LSHF parameters: n_estimators = 15, n_candidates = 100
 Building LSHForest for 10 samples in 100 dimensions
 Done in 1.492s
 Average time for lshf neighbor queries: 0.005s
 Average time for exact neighbor queries: 0.002s
 Average Accuracy : 0.88
 Speed up: 0.5x

 Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query
 matrix, we have:

- 0.03 spent in _query (hashing and descending)
- 0.91 spent in _compute_distances (exact distance calculation)
- 3.80 remaining in _get_candidates (ascending phase), almost all
of which is spent in _find_matching_indices

 Cutting exact distance calculation to 0s will still not get this faster
 than the exact approach. Of course, your mileage may vary, but this
 suggests to me you're barking up the wrong tree (no pun intended).

 On 30 July 2015 at 19:43, Maheshakya Wijewardena 
 pmaheshak...@gmail.com wrote:

 Hi,

 I've started to look into the matter of improving performance of
 LSHForest. As we have discussed sometime before(in fact, quite a long
 time), main concern is to Cythonize distance calculations. Currently, this
 done by iteratively moving over all the query vectors when `kneighbors`
 method is called for a set of query vectors. It has been identified that
 iterating over each query with Python loops is a huge overhead. I have
 implemented a few Cython hacks to demonstrate the distance calculation in
 LSHForest and I was able to get an approximate speedup 10x compared 

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-01 Thread Joel Nothman
@Maheshakya, will you be able to do work in the near future on speeding up
the ascending phase instead? Or should another contributor take that baton?
Not only does it seem to be a major contributor to runtime, but it is
independent of metric and hashing mechanism (within binary hashes), and
hence the most fundamental component of LSHForest.

On 30 July 2015 at 22:28, Joel Nothman joel.noth...@gmail.com wrote:

 (sorry, I should have said the first b layers, not 2**b layers, producing
 a memoization of 2**b offsets)

 On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com wrote:

 One approach to fixing the ascending phase would ensure that
 _find_matching_indices is only searching over parts of the tree that have
 not yet been explored, while currently it searches over the entire index at
 each depth.

 My preferred, but more experimental, solution is to memoize where the
 first 2**b layers of the tree begin and end in the index, for small b. So
 if our index stored:
 [[0, 0, 0, 1, 1, 0, 0, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 1, 0, 0, 0, 0, 0, 0],
  [0, 1, 0, 1, 1, 0, 0, 0],
  [0, 1, 1, 0, 0, 1, 1, 0],
  [1, 0, 0, 0, 0, 1, 0, 1],
  [1, 0, 0, 1, 0, 1, 0, 1],
  [1, 1, 0, 0, 0, 0, 0, 0],
  [1, 1, 1, 1, 1, 0, 0, 0]]
 and b=2, we'd memoize offsets for prefixes of size 2:
 [0, # 00
  3, # 01
  6, # 10
  8, # 11
 ]

 Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down to
 leave the first b bits [0, 1] remaining, and look them up in the array just
 defined to identify a much narrower search space [3, 6) matching that
 prefix in the overall index.

 Indeed, given the min_hash_match constraint, not having this sort of
 thing for b = min_hash_match seems wasteful.

 This provides us O(1) access to the top layers of the tree when
 ascending, and makes the searchsorted calls run in log(n / (2 ** b)) time
 rather than log(n). It is also much more like traditional LSH. However, it
 complexifies the code, as we now have to consider two strategies for
 descent/ascent.



 On 30 July 2015 at 21:46, Joel Nothman joel.noth...@gmail.com wrote:

 What makes you think this is the main bottleneck? While it is not an
 insignificant consumer of time, I really doubt this is what's making
 scikit-learn's LSH implementation severely underperform with respect to
 other implementations.

 We need to profile. In order to do that, we need some sensible
 parameters that users might actually want, e.g. number of features for
 {dense, sparse} cases, index size, target 10NN precision and recall
 (selecting corresponding n_estimators and n_candidates). Ideally we'd
 consider real-world datasets. And of course, these should be sensible for
 whichever metric we're operating over, and whether we're doing KNN or
 Radius searches.

 I don't know if it's realistic, but I've profiled the following
 bench_plot_approximate_neighbors settings:

 Building NearestNeighbors for 10 samples in 100 dimensions
 LSHF parameters: n_estimators = 15, n_candidates = 100
 Building LSHForest for 10 samples in 100 dimensions
 Done in 1.492s
 Average time for lshf neighbor queries: 0.005s
 Average time for exact neighbor queries: 0.002s
 Average Accuracy : 0.88
 Speed up: 0.5x

 Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query
 matrix, we have:

- 0.03 spent in _query (hashing and descending)
- 0.91 spent in _compute_distances (exact distance calculation)
- 3.80 remaining in _get_candidates (ascending phase), almost all of
which is spent in _find_matching_indices

 Cutting exact distance calculation to 0s will still not get this faster
 than the exact approach. Of course, your mileage may vary, but this
 suggests to me you're barking up the wrong tree (no pun intended).

 On 30 July 2015 at 19:43, Maheshakya Wijewardena pmaheshak...@gmail.com
  wrote:

 Hi,

 I've started to look into the matter of improving performance of
 LSHForest. As we have discussed sometime before(in fact, quite a long
 time), main concern is to Cythonize distance calculations. Currently, this
 done by iteratively moving over all the query vectors when `kneighbors`
 method is called for a set of query vectors. It has been identified that
 iterating over each query with Python loops is a huge overhead. I have
 implemented a few Cython hacks to demonstrate the distance calculation in
 LSHForest and I was able to get an approximate speedup 10x compared to
 current distance calculation with a Python loop. However,  I came across
 some blockers while trying to do this and need some clarifications.

 What I need to know is, do we use a mechanism to release GIL when we
 want to parallelize. One of my observations is `pairwise_distance` uses all
 the cores even when I don't specify `n_jobs` parameter which is 1 in
 default. Is this an expected behavior?

 If I want to release GIL, can I use OpenMP module in Cython? Or is that
 a task of Joblib?
 Any input on this is highly appreciated.

 Best regards,
 --


Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Gael Varoquaux
Hi,

This is absolutely great news. Thanks a lot. Please do open a WIP PR. We
(at INRIA) were planning to allocate time from someone this summer to
work on this. So you'll have someone reviewing / advicing.

With regards to releasing the gil, you need to use the 'with nogil'
statement in cython.

Have a look for instance at page 203 of the cython book:
https://books.google.fr/books?id=H3JmBgAAQBAJpg=PA203lpg=PA203dq=with+nogil+statementsource=blots=QH3Nh0uhsTsig=-F-q8NDOHkJx_lyXlJSnektZjushl=ensa=Xved=0CEYQ6AEwBWoVChMI1Iigj-OCxwIVAlkUCh0MFQOQ#v=onepageq=with%20nogil%20statementf=false

The parallel computing is then done with joblib parallel, using the
threadhing backend.

G



On Thu, Jul 30, 2015 at 03:13:30PM +0530, Maheshakya Wijewardena wrote:
 Hi,

 I've started to look into the matter of improving performance of LSHForest. As
 we have discussed sometime before(in fact, quite a long time), main concern is
 to Cythonize distance calculations. Currently, this done by iteratively moving
 over all the query vectors when `kneighbors` method is called for a set of
 query vectors. It has been identified that iterating over each query with
 Python loops is a huge overhead. I have implemented a few Cython hacks to
 demonstrate the distance calculation in LSHForest and I was able to get an
 approximate speedup 10x compared to current distance calculation with a Python
 loop. However,  I came across some blockers while trying to do this and need
 some clarifications.

 What I need to know is, do we use a mechanism to release GIL when we want to
 parallelize. One of my observations is `pairwise_distance` uses all the cores
 even when I don't specify `n_jobs` parameter which is 1 in default. Is this an
 expected behavior?

 If I want to release GIL, can I use OpenMP module in Cython? Or is that a task
 of Joblib?
 Any input on this is highly appreciated.

 Best regards,
-- 
Gael Varoquaux
Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux

--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Joel Nothman
What makes you think this is the main bottleneck? While it is not an
insignificant consumer of time, I really doubt this is what's making
scikit-learn's LSH implementation severely underperform with respect to
other implementations.

We need to profile. In order to do that, we need some sensible parameters
that users might actually want, e.g. number of features for {dense, sparse}
cases, index size, target 10NN precision and recall (selecting
corresponding n_estimators and n_candidates). Ideally we'd consider
real-world datasets. And of course, these should be sensible for whichever
metric we're operating over, and whether we're doing KNN or Radius searches.

I don't know if it's realistic, but I've profiled the following
bench_plot_approximate_neighbors settings:

Building NearestNeighbors for 10 samples in 100 dimensions
LSHF parameters: n_estimators = 15, n_candidates = 100
Building LSHForest for 10 samples in 100 dimensions
Done in 1.492s
Average time for lshf neighbor queries: 0.005s
Average time for exact neighbor queries: 0.002s
Average Accuracy : 0.88
Speed up: 0.5x

Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query matrix,
we have:

   - 0.03 spent in _query (hashing and descending)
   - 0.91 spent in _compute_distances (exact distance calculation)
   - 3.80 remaining in _get_candidates (ascending phase), almost all of
   which is spent in _find_matching_indices

Cutting exact distance calculation to 0s will still not get this faster
than the exact approach. Of course, your mileage may vary, but this
suggests to me you're barking up the wrong tree (no pun intended).

On 30 July 2015 at 19:43, Maheshakya Wijewardena pmaheshak...@gmail.com
wrote:

 Hi,

 I've started to look into the matter of improving performance of
 LSHForest. As we have discussed sometime before(in fact, quite a long
 time), main concern is to Cythonize distance calculations. Currently, this
 done by iteratively moving over all the query vectors when `kneighbors`
 method is called for a set of query vectors. It has been identified that
 iterating over each query with Python loops is a huge overhead. I have
 implemented a few Cython hacks to demonstrate the distance calculation in
 LSHForest and I was able to get an approximate speedup 10x compared to
 current distance calculation with a Python loop. However,  I came across
 some blockers while trying to do this and need some clarifications.

 What I need to know is, do we use a mechanism to release GIL when we want
 to parallelize. One of my observations is `pairwise_distance` uses all the
 cores even when I don't specify `n_jobs` parameter which is 1 in default.
 Is this an expected behavior?

 If I want to release GIL, can I use OpenMP module in Cython? Or is that a
 task of Joblib?
 Any input on this is highly appreciated.

 Best regards,
 --

 *Maheshakya Wijewardena,Undergraduate,*
 *Department of Computer Science and Engineering,*
 *Faculty of Engineering.*
 *University of Moratuwa,*
 *Sri Lanka*


 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Joel Nothman
One approach to fixing the ascending phase would ensure that
_find_matching_indices is only searching over parts of the tree that have
not yet been explored, while currently it searches over the entire index at
each depth.

My preferred, but more experimental, solution is to memoize where the first
2**b layers of the tree begin and end in the index, for small b. So if our
index stored:
[[0, 0, 0, 1, 1, 0, 0, 0],
 [0, 0, 1, 0, 1, 0, 1, 0],
 [0, 0, 1, 0, 1, 0, 1, 0],
 [0, 1, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 1, 1, 0, 0, 0],
 [0, 1, 1, 0, 0, 1, 1, 0],
 [1, 0, 0, 0, 0, 1, 0, 1],
 [1, 0, 0, 1, 0, 1, 0, 1],
 [1, 1, 0, 0, 0, 0, 0, 0],
 [1, 1, 1, 1, 1, 0, 0, 0]]
and b=2, we'd memoize offsets for prefixes of size 2:
[0, # 00
 3, # 01
 6, # 10
 8, # 11
]

Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down to
leave the first b bits [0, 1] remaining, and look them up in the array just
defined to identify a much narrower search space [3, 6) matching that
prefix in the overall index.

Indeed, given the min_hash_match constraint, not having this sort of thing
for b = min_hash_match seems wasteful.

This provides us O(1) access to the top layers of the tree when ascending,
and makes the searchsorted calls run in log(n / (2 ** b)) time rather than
log(n). It is also much more like traditional LSH. However, it complexifies
the code, as we now have to consider two strategies for descent/ascent.



On 30 July 2015 at 21:46, Joel Nothman joel.noth...@gmail.com wrote:

 What makes you think this is the main bottleneck? While it is not an
 insignificant consumer of time, I really doubt this is what's making
 scikit-learn's LSH implementation severely underperform with respect to
 other implementations.

 We need to profile. In order to do that, we need some sensible parameters
 that users might actually want, e.g. number of features for {dense, sparse}
 cases, index size, target 10NN precision and recall (selecting
 corresponding n_estimators and n_candidates). Ideally we'd consider
 real-world datasets. And of course, these should be sensible for whichever
 metric we're operating over, and whether we're doing KNN or Radius searches.

 I don't know if it's realistic, but I've profiled the following
 bench_plot_approximate_neighbors settings:

 Building NearestNeighbors for 10 samples in 100 dimensions
 LSHF parameters: n_estimators = 15, n_candidates = 100
 Building LSHForest for 10 samples in 100 dimensions
 Done in 1.492s
 Average time for lshf neighbor queries: 0.005s
 Average time for exact neighbor queries: 0.002s
 Average Accuracy : 0.88
 Speed up: 0.5x

 Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query matrix,
 we have:

- 0.03 spent in _query (hashing and descending)
- 0.91 spent in _compute_distances (exact distance calculation)
- 3.80 remaining in _get_candidates (ascending phase), almost all of
which is spent in _find_matching_indices

 Cutting exact distance calculation to 0s will still not get this faster
 than the exact approach. Of course, your mileage may vary, but this
 suggests to me you're barking up the wrong tree (no pun intended).

 On 30 July 2015 at 19:43, Maheshakya Wijewardena pmaheshak...@gmail.com
 wrote:

 Hi,

 I've started to look into the matter of improving performance of
 LSHForest. As we have discussed sometime before(in fact, quite a long
 time), main concern is to Cythonize distance calculations. Currently, this
 done by iteratively moving over all the query vectors when `kneighbors`
 method is called for a set of query vectors. It has been identified that
 iterating over each query with Python loops is a huge overhead. I have
 implemented a few Cython hacks to demonstrate the distance calculation in
 LSHForest and I was able to get an approximate speedup 10x compared to
 current distance calculation with a Python loop. However,  I came across
 some blockers while trying to do this and need some clarifications.

 What I need to know is, do we use a mechanism to release GIL when we want
 to parallelize. One of my observations is `pairwise_distance` uses all the
 cores even when I don't specify `n_jobs` parameter which is 1 in default.
 Is this an expected behavior?

 If I want to release GIL, can I use OpenMP module in Cython? Or is that a
 task of Joblib?
 Any input on this is highly appreciated.

 Best regards,
 --

 *Maheshakya Wijewardena,Undergraduate,*
 *Department of Computer Science and Engineering,*
 *Faculty of Engineering.*
 *University of Moratuwa,*
 *Sri Lanka*


 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



--
___
Scikit-learn-general mailing list

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Joel Nothman
(sorry, I should have said the first b layers, not 2**b layers, producing a
memoization of 2**b offsets)

On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com wrote:

 One approach to fixing the ascending phase would ensure that
 _find_matching_indices is only searching over parts of the tree that have
 not yet been explored, while currently it searches over the entire index at
 each depth.

 My preferred, but more experimental, solution is to memoize where the
 first 2**b layers of the tree begin and end in the index, for small b. So
 if our index stored:
 [[0, 0, 0, 1, 1, 0, 0, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 1, 0, 0, 0, 0, 0, 0],
  [0, 1, 0, 1, 1, 0, 0, 0],
  [0, 1, 1, 0, 0, 1, 1, 0],
  [1, 0, 0, 0, 0, 1, 0, 1],
  [1, 0, 0, 1, 0, 1, 0, 1],
  [1, 1, 0, 0, 0, 0, 0, 0],
  [1, 1, 1, 1, 1, 0, 0, 0]]
 and b=2, we'd memoize offsets for prefixes of size 2:
 [0, # 00
  3, # 01
  6, # 10
  8, # 11
 ]

 Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down to
 leave the first b bits [0, 1] remaining, and look them up in the array just
 defined to identify a much narrower search space [3, 6) matching that
 prefix in the overall index.

 Indeed, given the min_hash_match constraint, not having this sort of thing
 for b = min_hash_match seems wasteful.

 This provides us O(1) access to the top layers of the tree when ascending,
 and makes the searchsorted calls run in log(n / (2 ** b)) time rather than
 log(n). It is also much more like traditional LSH. However, it complexifies
 the code, as we now have to consider two strategies for descent/ascent.



 On 30 July 2015 at 21:46, Joel Nothman joel.noth...@gmail.com wrote:

 What makes you think this is the main bottleneck? While it is not an
 insignificant consumer of time, I really doubt this is what's making
 scikit-learn's LSH implementation severely underperform with respect to
 other implementations.

 We need to profile. In order to do that, we need some sensible parameters
 that users might actually want, e.g. number of features for {dense, sparse}
 cases, index size, target 10NN precision and recall (selecting
 corresponding n_estimators and n_candidates). Ideally we'd consider
 real-world datasets. And of course, these should be sensible for whichever
 metric we're operating over, and whether we're doing KNN or Radius searches.

 I don't know if it's realistic, but I've profiled the following
 bench_plot_approximate_neighbors settings:

 Building NearestNeighbors for 10 samples in 100 dimensions
 LSHF parameters: n_estimators = 15, n_candidates = 100
 Building LSHForest for 10 samples in 100 dimensions
 Done in 1.492s
 Average time for lshf neighbor queries: 0.005s
 Average time for exact neighbor queries: 0.002s
 Average Accuracy : 0.88
 Speed up: 0.5x

 Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query
 matrix, we have:

- 0.03 spent in _query (hashing and descending)
- 0.91 spent in _compute_distances (exact distance calculation)
- 3.80 remaining in _get_candidates (ascending phase), almost all of
which is spent in _find_matching_indices

 Cutting exact distance calculation to 0s will still not get this faster
 than the exact approach. Of course, your mileage may vary, but this
 suggests to me you're barking up the wrong tree (no pun intended).

 On 30 July 2015 at 19:43, Maheshakya Wijewardena pmaheshak...@gmail.com
 wrote:

 Hi,

 I've started to look into the matter of improving performance of
 LSHForest. As we have discussed sometime before(in fact, quite a long
 time), main concern is to Cythonize distance calculations. Currently, this
 done by iteratively moving over all the query vectors when `kneighbors`
 method is called for a set of query vectors. It has been identified that
 iterating over each query with Python loops is a huge overhead. I have
 implemented a few Cython hacks to demonstrate the distance calculation in
 LSHForest and I was able to get an approximate speedup 10x compared to
 current distance calculation with a Python loop. However,  I came across
 some blockers while trying to do this and need some clarifications.

 What I need to know is, do we use a mechanism to release GIL when we
 want to parallelize. One of my observations is `pairwise_distance` uses all
 the cores even when I don't specify `n_jobs` parameter which is 1 in
 default. Is this an expected behavior?

 If I want to release GIL, can I use OpenMP module in Cython? Or is that
 a task of Joblib?
 Any input on this is highly appreciated.

 Best regards,
 --

 *Maheshakya Wijewardena,Undergraduate,*
 *Department of Computer Science and Engineering,*
 *Faculty of Engineering.*
 *University of Moratuwa,*
 *Sri Lanka*


 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 

[Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Maheshakya Wijewardena
Hi,

I've started to look into the matter of improving performance of LSHForest.
As we have discussed sometime before(in fact, quite a long time), main
concern is to Cythonize distance calculations. Currently, this done by
iteratively moving over all the query vectors when `kneighbors` method is
called for a set of query vectors. It has been identified that iterating
over each query with Python loops is a huge overhead. I have implemented a
few Cython hacks to demonstrate the distance calculation in LSHForest and I
was able to get an approximate speedup 10x compared to current distance
calculation with a Python loop. However,  I came across some blockers while
trying to do this and need some clarifications.

What I need to know is, do we use a mechanism to release GIL when we want
to parallelize. One of my observations is `pairwise_distance` uses all the
cores even when I don't specify `n_jobs` parameter which is 1 in default.
Is this an expected behavior?

If I want to release GIL, can I use OpenMP module in Cython? Or is that a
task of Joblib?
Any input on this is highly appreciated.

Best regards,
-- 

*Maheshakya Wijewardena,Undergraduate,*
*Department of Computer Science and Engineering,*
*Faculty of Engineering.*
*University of Moratuwa,*
*Sri Lanka*
--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general