Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-01 Thread Maheshakya Wijewardena
Hi Joel,
I was on vacation during past 3 days. I''ll look into this asap and let you
all know.

I also did some profiling, but only with the usage of `pairwise_distance`
method. Brute force technique directly uses that for the entire query
array, but LSH uses that in a loop and I noticed there is a huge lag. I'll
first confirm your claims. I can start working on this but I think I'll
need your or some other contributers' reviewing as well . I'll do this if
it's possible.

On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman joel.noth...@gmail.com wrote:

 @Maheshakya, will you be able to do work in the near future on speeding up
 the ascending phase instead? Or should another contributor take that baton?
 Not only does it seem to be a major contributor to runtime, but it is
 independent of metric and hashing mechanism (within binary hashes), and
 hence the most fundamental component of LSHForest.T

 On 30 July 2015 at 22:28, Joel Nothman joel.noth...@gmail.com wrote:

 (sorry, I should have said the first b layers, not 2**b layers, producing
 a memoization of 2**b offsets

 On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com wrote:

 One approach to fixing the ascending phase would ensure that
 _find_matching_indices is only searching over parts of the tree that have
 not yet been explored, while currently it searches over the entire index at
 each depth.

 My preferred, but more experimental, solution is to memoize where the
 first 2**b layers of the tree begin and end in the index, for small b. So
 if our index stored:
 [[0, 0, 0, 1, 1, 0, 0, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 1, 0, 0, 0, 0, 0, 0],
  [0, 1, 0, 1, 1, 0, 0, 0],
  [0, 1, 1, 0, 0, 1, 1, 0],
  [1, 0, 0, 0, 0, 1, 0, 1],
  [1, 0, 0, 1, 0, 1, 0, 1],
  [1, 1, 0, 0, 0, 0, 0, 0],
  [1, 1, 1, 1, 1, 0, 0, 0]]
 and b=2, we'd memoize offsets for prefixes of size 2:
 [0, # 00
  3, # 01
  6, # 10
  8, # 11
 ]

 Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down to
 leave the first b bits [0, 1] remaining, and look them up in the array just
 defined to identify a much narrower search space [3, 6) matching that
 prefix in the overall index.

 Indeed, given the min_hash_match constraint, not having this sort of
 thing for b = min_hash_match seems wasteful.

 This provides us O(1) access to the top layers of the tree when
 ascending, and makes the searchsorted calls run in log(n / (2 ** b)) time
 rather than log(n). It is also much more like traditional LSH. However, it
 complexifies the code, as we now have to consider two strategies for
 descent/ascent.



 On 30 July 2015 at 21:46, Joel Nothman joel.noth...@gmail.com wrote:

 What makes you think this is the main bottleneck? While it is not an
 insignificant consumer of time, I really doubt this is what's making
 scikit-learn's LSH implementation severely underperform with respect to
 other implementations.

 We need to profile. In order to do that, we need some sensible
 parameters that users might actually want, e.g. number of features for
 {dense, sparse} cases, index size, target 10NN precision and recall
 (selecting corresponding n_estimators and n_candidates). Ideally we'd
 consider real-world datasets. And of course, these should be sensible for
 whichever metric we're operating over, and whether we're doing KNN or
 Radius searches.

 I don't know if it's realistic, but I've profiled the following
 bench_plot_approximate_neighbors settings:

 Building NearestNeighbors for 10 samples in 100 dimensions
 LSHF parameters: n_estimators = 15, n_candidates = 100
 Building LSHForest for 10 samples in 100 dimensions
 Done in 1.492s
 Average time for lshf neighbor queries: 0.005s
 Average time for exact neighbor queries: 0.002s
 Average Accuracy : 0.88
 Speed up: 0.5x

 Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query
 matrix, we have:

- 0.03 spent in _query (hashing and descending)
- 0.91 spent in _compute_distances (exact distance calculation)
- 3.80 remaining in _get_candidates (ascending phase), almost all
of which is spent in _find_matching_indices

 Cutting exact distance calculation to 0s will still not get this faster
 than the exact approach. Of course, your mileage may vary, but this
 suggests to me you're barking up the wrong tree (no pun intended).

 On 30 July 2015 at 19:43, Maheshakya Wijewardena 
 pmaheshak...@gmail.com wrote:

 Hi,

 I've started to look into the matter of improving performance of
 LSHForest. As we have discussed sometime before(in fact, quite a long
 time), main concern is to Cythonize distance calculations. Currently, this
 done by iteratively moving over all the query vectors when `kneighbors`
 method is called for a set of query vectors. It has been identified that
 iterating over each query with Python loops is a huge overhead. I have
 implemented a few Cython hacks to demonstrate the distance calculation in
 LSHForest and I was able to get an approximate speedup 10x compared 

Re: [Scikit-learn-general] About contributing code

2015-08-01 Thread Prokopis Gryllos
Hi Andreas,

Thank you for your reply!

The NFL belongs in the k-NN family of algorithms for classification /
regression.
In general, for a query point, NFL works like k-NN but instead of using the
k feature
points to determine the results, it generalized any two feature points of
the same class
by a feature line and then computes the distance of the query point from
its projection
on feature lines.
It is supposed to improve the results of NN in some cases and especially on
patter
recognition tasks.

You can have a look here
http://www.scholarpedia.org/article/Nearest_feature_line
http://www.dsp.toronto.edu/juwei/Publication/IEEE_NN.pdf

I recently learnt about it while studying for a feature extraction course.
I haven't implemented
/ tested it yet and I am new to sci-kit learn but I thought it would be
fun (from my side) to
implement it and have the opportunity to contribute too. I know that this
is not how things
work for you. To add a feature to the library it must be something that is
going to be useful
for many people so that there is a reason for maintaining it.

Having a look at the issue page on github I think I can help with some
minor contributions
like adding more related projects in documentation (I will see If I can do
that asap). Besides
that my idea is to implement something from scratch so that I will be able
to get familiar with
the project step by step and if I end up with something that you will want
to add to the library
I will be very happy to contribute.

Thank you for your time. I will continue watching the issue page and maybe
help with something.

Best Regards,
Prokopis

On Tue, Jul 28, 2015 at 8:43 PM, Andreas Mueller t3k...@gmail.com wrote:

 Hi Gryllos.

 Before contributing a new feature (which is usually a major undertaking)
 it us usually a good idea to get started working on known issues, have a
 look at the issue tracker.

 I'm not familiar with the feature line approach. Can you elaborate and
 provide a reference?
 Please see the FAQ for our policy on algorithms we like to include:


 http://scikit-learn.org/dev/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published


 http://scikit-learn.org/dev/faq.html#can-i-add-this-classical-algorithm-from-the-80s

 Cheers,
 Andy


 On 07/23/2015 06:54 PM, Prokopis Gryllos wrote:

 Hi everyone,

 I would like to contribute code to the project and I was thinking of
 implementing
 a nearest feature line approach to the nearest neighbor class.
 As it is suggested in the instruction set about contributing I thought it
 would be best
 to ask you first before I start working on it.

 Thank you in advance, I am waiting for your reply!

 Best Regards,
 Gryllos Prokopis


 --



 ___
 Scikit-learn-general mailing 
 listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-01 Thread Joel Nothman
@Maheshakya, will you be able to do work in the near future on speeding up
the ascending phase instead? Or should another contributor take that baton?
Not only does it seem to be a major contributor to runtime, but it is
independent of metric and hashing mechanism (within binary hashes), and
hence the most fundamental component of LSHForest.

On 30 July 2015 at 22:28, Joel Nothman joel.noth...@gmail.com wrote:

 (sorry, I should have said the first b layers, not 2**b layers, producing
 a memoization of 2**b offsets)

 On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com wrote:

 One approach to fixing the ascending phase would ensure that
 _find_matching_indices is only searching over parts of the tree that have
 not yet been explored, while currently it searches over the entire index at
 each depth.

 My preferred, but more experimental, solution is to memoize where the
 first 2**b layers of the tree begin and end in the index, for small b. So
 if our index stored:
 [[0, 0, 0, 1, 1, 0, 0, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 1, 0, 0, 0, 0, 0, 0],
  [0, 1, 0, 1, 1, 0, 0, 0],
  [0, 1, 1, 0, 0, 1, 1, 0],
  [1, 0, 0, 0, 0, 1, 0, 1],
  [1, 0, 0, 1, 0, 1, 0, 1],
  [1, 1, 0, 0, 0, 0, 0, 0],
  [1, 1, 1, 1, 1, 0, 0, 0]]
 and b=2, we'd memoize offsets for prefixes of size 2:
 [0, # 00
  3, # 01
  6, # 10
  8, # 11
 ]

 Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down to
 leave the first b bits [0, 1] remaining, and look them up in the array just
 defined to identify a much narrower search space [3, 6) matching that
 prefix in the overall index.

 Indeed, given the min_hash_match constraint, not having this sort of
 thing for b = min_hash_match seems wasteful.

 This provides us O(1) access to the top layers of the tree when
 ascending, and makes the searchsorted calls run in log(n / (2 ** b)) time
 rather than log(n). It is also much more like traditional LSH. However, it
 complexifies the code, as we now have to consider two strategies for
 descent/ascent.



 On 30 July 2015 at 21:46, Joel Nothman joel.noth...@gmail.com wrote:

 What makes you think this is the main bottleneck? While it is not an
 insignificant consumer of time, I really doubt this is what's making
 scikit-learn's LSH implementation severely underperform with respect to
 other implementations.

 We need to profile. In order to do that, we need some sensible
 parameters that users might actually want, e.g. number of features for
 {dense, sparse} cases, index size, target 10NN precision and recall
 (selecting corresponding n_estimators and n_candidates). Ideally we'd
 consider real-world datasets. And of course, these should be sensible for
 whichever metric we're operating over, and whether we're doing KNN or
 Radius searches.

 I don't know if it's realistic, but I've profiled the following
 bench_plot_approximate_neighbors settings:

 Building NearestNeighbors for 10 samples in 100 dimensions
 LSHF parameters: n_estimators = 15, n_candidates = 100
 Building LSHForest for 10 samples in 100 dimensions
 Done in 1.492s
 Average time for lshf neighbor queries: 0.005s
 Average time for exact neighbor queries: 0.002s
 Average Accuracy : 0.88
 Speed up: 0.5x

 Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query
 matrix, we have:

- 0.03 spent in _query (hashing and descending)
- 0.91 spent in _compute_distances (exact distance calculation)
- 3.80 remaining in _get_candidates (ascending phase), almost all of
which is spent in _find_matching_indices

 Cutting exact distance calculation to 0s will still not get this faster
 than the exact approach. Of course, your mileage may vary, but this
 suggests to me you're barking up the wrong tree (no pun intended).

 On 30 July 2015 at 19:43, Maheshakya Wijewardena pmaheshak...@gmail.com
  wrote:

 Hi,

 I've started to look into the matter of improving performance of
 LSHForest. As we have discussed sometime before(in fact, quite a long
 time), main concern is to Cythonize distance calculations. Currently, this
 done by iteratively moving over all the query vectors when `kneighbors`
 method is called for a set of query vectors. It has been identified that
 iterating over each query with Python loops is a huge overhead. I have
 implemented a few Cython hacks to demonstrate the distance calculation in
 LSHForest and I was able to get an approximate speedup 10x compared to
 current distance calculation with a Python loop. However,  I came across
 some blockers while trying to do this and need some clarifications.

 What I need to know is, do we use a mechanism to release GIL when we
 want to parallelize. One of my observations is `pairwise_distance` uses all
 the cores even when I don't specify `n_jobs` parameter which is 1 in
 default. Is this an expected behavior?

 If I want to release GIL, can I use OpenMP module in Cython? Or is that
 a task of Joblib?
 Any input on this is highly appreciated.

 Best regards,
 --