Re: [Scikit-learn-general] Performance of LSHForest

2015-04-20 Thread Daniel Vainsencher
On 04/19/2015 08:18 AM, Joel Nothman wrote:


 On 17 April 2015 at 13:52, Daniel Vainsencher
 daniel.vainsenc...@gmail.com mailto:daniel.vainsenc...@gmail.com wrote:

 On 04/16/2015 05:49 PM, Joel Nothman wrote:
  I more or less agree. Certainly we only need to do one searchsorted per
  query per tree, and then do linear scans. There is a question of how
  close we stay to the original LSHForest algorithm, which relies on
  matching prefixes rather than hamming distance. Hamming distance is
  easier to calculate in NumPy and is probably faster to calculate in C
  too (with or without using POPCNT). Perhaps the only advantage of using
  Cython in your solution is to avoid the memory overhead of unpackbits.
 You obviously know more than I do about Cython vs numpy options.

  However, n_candidates before and after is arguably not sufficient if one
  side has more than n_candidates with a high prefix overlap.
 I disagree. Being able to look at 2*n_candidates that must contain
 n_candidates of the closest ones, rather than as many as happen to
 agree on x number of bits is a feature, not a bug. Especially if we
 then filter them by hamming distance.
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


 But it need not contain the closest ones that would have been retrieved
 by LSHForest (assuming we're only looking at a single tree). Let's say
 n_candidates is 1, our query is 11 and our index contains

 A. 10 agreed = 1
 B. 110011 agreed = 3
 C. 110100 agreed = 5

 A binary search will find A-B. The n-candidates x 2 window includes A
 and B. C is closer and has a longer prefix overlap with the query than A
 does. My understanding of LSHForest is that its ascent by prefix length
 would necessarily find C. Your approach would not.
Agreed!


 While that may be a feature of your approach, I think we have reason to
 prefer a published algorithm.
Ok, then I guess this is not the place for this idea.



 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with Bonita BPM through live exercises
 http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
 source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF



 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Performance of LSHForest

2015-04-19 Thread Joel Nothman
On 17 April 2015 at 13:52, Daniel Vainsencher daniel.vainsenc...@gmail.com
wrote:

 On 04/16/2015 05:49 PM, Joel Nothman wrote:
  I more or less agree. Certainly we only need to do one searchsorted per
  query per tree, and then do linear scans. There is a question of how
  close we stay to the original LSHForest algorithm, which relies on
  matching prefixes rather than hamming distance. Hamming distance is
  easier to calculate in NumPy and is probably faster to calculate in C
  too (with or without using POPCNT). Perhaps the only advantage of using
  Cython in your solution is to avoid the memory overhead of unpackbits.
 You obviously know more than I do about Cython vs numpy options.

  However, n_candidates before and after is arguably not sufficient if one
  side has more than n_candidates with a high prefix overlap.
 I disagree. Being able to look at 2*n_candidates that must contain
 n_candidates of the closest ones, rather than as many as happen to
 agree on x number of bits is a feature, not a bug. Especially if we
 then filter them by hamming distance.
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


But it need not contain the closest ones that would have been retrieved by
LSHForest (assuming we're only looking at a single tree). Let's say
n_candidates is 1, our query is 11 and our index contains

A. 10 agreed = 1
B. 110011 agreed = 3
C. 110100 agreed = 5

A binary search will find A-B. The n-candidates x 2 window includes A and
B. C is closer and has a longer prefix overlap with the query than A does.
My understanding of LSHForest is that its ascent by prefix length would
necessarily find C. Your approach would not.

While that may be a feature of your approach, I think we have reason to
prefer a published algorithm.
--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Performance of LSHForest

2015-04-16 Thread Daniel Vainsencher
Hi Joel,

To extend your analysis:
- when n_samples*n_indices is large enough, the bottleneck is the use of 
the index, as you say.
- when n_dimensions*n_candidates is large enough, the bottleneck is 
computation of true distances between DB points and the query.

To serve well both kinds of use cases is perfectly possible, but 
requires use of the index that is both:
A) Fast
B) Uses the index optimally to reduce the number of candidates for which 
we compare distances.

Here is a variant of your proposal (better keep track of context) that 
also requires a little Cython but improves both aspects A and B and 
reduces code complexity.

Observation I:
Only a single binary search per index is necessary, the first. After we 
find the correct location for the query binary code, we can restrict 
ourselves to the n_candidates (or even fewer) before and after that 
location.

So no further binary searches are necessary at all, and the restriction 
to a small linear part of the array should be much more cache friendly. 
This makes full use of our array implementation of orderedcollection, 
instead of acting as if we were still on a binary tree implementation as 
in the original LSH-Forest paper.

There is a price to pay for this simplification: we are now looking at 
(computing full distance from query for) 2*n_candidates*n_indices 
points, which can be expensive (we improved A at a cost to B).

But here is where some Cython can be really useful. Observation II:
The best information we can extract from the binary representation is 
not the distances in the tree structure, but hamming distances to the query.

So after the restriction of I, compute the *hamming distances* of the 
2*n_candidate*n_indices points each from the binary representation of 
the query (corresponding to the appropriate index). Then compute full 
metric only for the n_candidates with the lowest hamming distances.

This should achieve a pretty good sweet spot of performance, with just a 
bit of Cython.

Daniel

On 04/16/2015 12:18 AM, Joel Nothman wrote:
 Once we're dealing with large enough index and n_candidates, most time
 is spent in searchsorted in the synchronous ascending phase, while any
 overhead around it is marginal. Currently we are searching over the
 whole array in each searchsorted, while it could be rewritten to keep
 better track of context to cut down the overall array when searching.
 While possible, I suspect this will look confusing in Python/Numpy, and
 Cython will be a clearer and faster way to present this logic.

 On the other hand, time spent in _compute_distances is substantial, and
 yet most of its consumption is /outside/ of pairwise_distances. This
 commit
 https://github.com/scikit-learn/scikit-learn/commit/c1f335f70aa0f766a930f8ac54eeaa601245725a
 cuts a basic benchmark from 85 to 70 seconds. Vote here for merge
 https://github.com/scikit-learn/scikit-learn/pull/4603!

 On 16 April 2015 at 12:32, Maheshakya Wijewardena
 pmaheshak...@gmail.com mailto:pmaheshak...@gmail.com wrote:

 Moreover, this drawback occurs because LSHForest does not vectorize
 multiple queries as in 'ball_tree' or any other method. This slows
 the exact neighbor distance calculation down significantly after
 approximation. This will not be a problem if queries are for
 individual points. Unfortunately, former is the more useful usage of
 LSHForest.
 Are you trying individual queries or multiple queries (for n_samples)?

 On Thu, Apr 16, 2015 at 6:14 AM, Daniel Vainsencher
 daniel.vainsenc...@gmail.com mailto:daniel.vainsenc...@gmail.com
 wrote:

 LHSForest is not intended for dimensions at which exact methods
 work well, nor for tiny datasets. Try d500, n_points10, I
 don't remember the switchover point.

 The documentation should make this clear, but unfortunately I
 don't see that it does.

 On Apr 15, 2015 7:08 PM, Joel Nothman joel.noth...@gmail.com
 mailto:joel.noth...@gmail.com wrote:

 I agree this is disappointing, and we need to work on making
 LSHForest faster. Portions should probably be coded in
 Cython, for instance, as the current implementation is a bit
 circuitous in order to work in numpy. PRs are welcome.

 LSHForest could use parallelism to be faster, but so can
 (and will) the exact neighbors methods. In theory in
 LSHForest, each tree could be stored on entirely different
 machines, providing memory benefits, but scikit-learn can't
 really take advantage of this.

 Having said that, I would also try with higher n_features
 and n_queries. We have to limit the scale of our examples in
 order to limit the overall document compilation time.

 On 16 April 2015 at 01:12, Miroslav Batchkarov
 mbatchka...@gmail.com mailto:mbatchka...@gmail.com wrote:

   

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-16 Thread Joel Nothman
I more or less agree. Certainly we only need to do one searchsorted per
query per tree, and then do linear scans. There is a question of how close
we stay to the original LSHForest algorithm, which relies on matching
prefixes rather than hamming distance. Hamming distance is easier to
calculate in NumPy and is probably faster to calculate in C too (with or
without using POPCNT). Perhaps the only advantage of using Cython in your
solution is to avoid the memory overhead of unpackbits.

However, n_candidates before and after is arguably not sufficient if one
side has more than n_candidates with a high prefix overlap. But until we
look at the suffixes we can't know if it is closer or farther in hamming
distance.

I also think the use of n_candidates in the current code is somewhat
broken, as suggested by my XXX comment in _get_candidates, which we
discussed but did not resolve clearly. I think it will be hard to make
improvements of this sort without breaking the current results and
parameter sensitivities of the implementation.

On 17 April 2015 at 00:16, Daniel Vainsencher daniel.vainsenc...@gmail.com
wrote:

 Hi Joel,

 To extend your analysis:
 - when n_samples*n_indices is large enough, the bottleneck is the use of
 the index, as you say.
 - when n_dimensions*n_candidates is large enough, the bottleneck is
 computation of true distances between DB points and the query.

 To serve well both kinds of use cases is perfectly possible, but
 requires use of the index that is both:
 A) Fast
 B) Uses the index optimally to reduce the number of candidates for which
 we compare distances.

 Here is a variant of your proposal (better keep track of context) that
 also requires a little Cython but improves both aspects A and B and
 reduces code complexity.

 Observation I:
 Only a single binary search per index is necessary, the first. After we
 find the correct location for the query binary code, we can restrict
 ourselves to the n_candidates (or even fewer) before and after that
 location.

 So no further binary searches are necessary at all, and the restriction
 to a small linear part of the array should be much more cache friendly.
 This makes full use of our array implementation of orderedcollection,
 instead of acting as if we were still on a binary tree implementation as
 in the original LSH-Forest paper.

 There is a price to pay for this simplification: we are now looking at
 (computing full distance from query for) 2*n_candidates*n_indices
 points, which can be expensive (we improved A at a cost to B).

 But here is where some Cython can be really useful. Observation II:
 The best information we can extract from the binary representation is
 not the distances in the tree structure, but hamming distances to the
 query.

 So after the restriction of I, compute the *hamming distances* of the
 2*n_candidate*n_indices points each from the binary representation of
 the query (corresponding to the appropriate index). Then compute full
 metric only for the n_candidates with the lowest hamming distances.

 This should achieve a pretty good sweet spot of performance, with just a
 bit of Cython.

 Daniel

 On 04/16/2015 12:18 AM, Joel Nothman wrote:
  Once we're dealing with large enough index and n_candidates, most time
  is spent in searchsorted in the synchronous ascending phase, while any
  overhead around it is marginal. Currently we are searching over the
  whole array in each searchsorted, while it could be rewritten to keep
  better track of context to cut down the overall array when searching.
  While possible, I suspect this will look confusing in Python/Numpy, and
  Cython will be a clearer and faster way to present this logic.
 
  On the other hand, time spent in _compute_distances is substantial, and
  yet most of its consumption is /outside/ of pairwise_distances. This
  commit
  
 https://github.com/scikit-learn/scikit-learn/commit/c1f335f70aa0f766a930f8ac54eeaa601245725a
 
  cuts a basic benchmark from 85 to 70 seconds. Vote here for merge
  https://github.com/scikit-learn/scikit-learn/pull/4603!
 
  On 16 April 2015 at 12:32, Maheshakya Wijewardena
  pmaheshak...@gmail.com mailto:pmaheshak...@gmail.com wrote:
 
  Moreover, this drawback occurs because LSHForest does not vectorize
  multiple queries as in 'ball_tree' or any other method. This slows
  the exact neighbor distance calculation down significantly after
  approximation. This will not be a problem if queries are for
  individual points. Unfortunately, former is the more useful usage of
  LSHForest.
  Are you trying individual queries or multiple queries (for
 n_samples)?
 
  On Thu, Apr 16, 2015 at 6:14 AM, Daniel Vainsencher
  daniel.vainsenc...@gmail.com mailto:daniel.vainsenc...@gmail.com
  wrote:
 
  LHSForest is not intended for dimensions at which exact methods
  work well, nor for tiny datasets. Try d500, n_points10, I
  don't remember the switchover point.
 
 

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-16 Thread Daniel Vainsencher
On 04/16/2015 05:49 PM, Joel Nothman wrote:
 I more or less agree. Certainly we only need to do one searchsorted per
 query per tree, and then do linear scans. There is a question of how
 close we stay to the original LSHForest algorithm, which relies on
 matching prefixes rather than hamming distance. Hamming distance is
 easier to calculate in NumPy and is probably faster to calculate in C
 too (with or without using POPCNT). Perhaps the only advantage of using
 Cython in your solution is to avoid the memory overhead of unpackbits.
You obviously know more than I do about Cython vs numpy options.

 However, n_candidates before and after is arguably not sufficient if one
 side has more than n_candidates with a high prefix overlap.
I disagree. Being able to look at 2*n_candidates that must contain 
n_candidates of the closest ones, rather than as many as happen to 
agree on x number of bits is a feature, not a bug. Especially if we 
then filter them by hamming distance.


 But until we
 look at the suffixes we can't know if it is closer or farther in hamming
 distance.

 I also think the use of n_candidates in the current code is somewhat
 broken, as suggested by my XXX comment in _get_candidates, which we
 discussed but did not resolve clearly.
I think n_candidates embodies a reasonable desire to be able to set the 
investment per query, but there are many ways to do that. Since the 
research does not provide very good advice (I think the guarantees 
require looking at sqrt(db size) candidates for every query), it leaves 
the field wide open for different schemes. Should the number of 
distances calculated be per number of indices, or fixed? etc.

I am proposing a particular definition:
- Read 2*n_candidates*n_indices hashes
- Calculate n_candidates full distances.
The motivation for the first is that it allows us to get all 
n_candidates from the same side of the same index if that is where the 
good stuff (hamming distance-wise) is, and seems not-too-expensive in 
calculating hamming distances. But if someone argues we should calculate 
10x as many hamming distances to decide what distances to compute, I 
don't know a very good argument one way or the other.

 I think it will be hard to make
 improvements of this sort without breaking the current results and
 parameter sensitivities of the implementation.
You seem to be assuming it is tuned; I am not even sure there exists a 
precise sense in which it is tunable, except for a particular dataset 
(and that is not very useful) :)

Daniel



 On 17 April 2015 at 00:16, Daniel Vainsencher
 daniel.vainsenc...@gmail.com mailto:daniel.vainsenc...@gmail.com wrote:

 Hi Joel,

 To extend your analysis:
 - when n_samples*n_indices is large enough, the bottleneck is the use of
 the index, as you say.
 - when n_dimensions*n_candidates is large enough, the bottleneck is
 computation of true distances between DB points and the query.

 To serve well both kinds of use cases is perfectly possible, but
 requires use of the index that is both:
 A) Fast
 B) Uses the index optimally to reduce the number of candidates for which
 we compare distances.

 Here is a variant of your proposal (better keep track of context) that
 also requires a little Cython but improves both aspects A and B and
 reduces code complexity.

 Observation I:
 Only a single binary search per index is necessary, the first. After we
 find the correct location for the query binary code, we can restrict
 ourselves to the n_candidates (or even fewer) before and after that
 location.

 So no further binary searches are necessary at all, and the restriction
 to a small linear part of the array should be much more cache friendly.
 This makes full use of our array implementation of orderedcollection,
 instead of acting as if we were still on a binary tree implementation as
 in the original LSH-Forest paper.

 There is a price to pay for this simplification: we are now looking at
 (computing full distance from query for) 2*n_candidates*n_indices
 points, which can be expensive (we improved A at a cost to B).

 But here is where some Cython can be really useful. Observation II:
 The best information we can extract from the binary representation is
 not the distances in the tree structure, but hamming distances to
 the query.

 So after the restriction of I, compute the *hamming distances* of the
 2*n_candidate*n_indices points each from the binary representation of
 the query (corresponding to the appropriate index). Then compute full
 metric only for the n_candidates with the lowest hamming distances.

 This should achieve a pretty good sweet spot of performance, with just a
 bit of Cython.

 Daniel

 On 04/16/2015 12:18 AM, Joel Nothman wrote:
  Once we're dealing with large enough index and n_candidates, most time
  is spent in searchsorted in the 

[Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Miroslav Batchkarov
Hi everyone,

was really impressed by the speedups provided by LSHForest compared to 
brute-force search. Out of curiosity, I compared LSRForest to the existing ball 
tree implementation. The approximate algorithm is consistently slower (see 
below). Is this normal and should it be mentioned in the documentation? Does 
approximate search offer any benefits in terms of memory usage?


I ran the same example 
http://scikit-learn.org/stable/auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html#example-neighbors-plot-approximate-nearest-neighbors-scalability-py
 with a algorithm=ball_tree. I also had to set metric=‘euclidean’ (this may 
affect results). The output is:

Index size: 1000, exact: 0.000s, LSHF: 0.007s, speedup: 0.0, accuracy: 1.00 
+/-0.00
Index size: 2511, exact: 0.001s, LSHF: 0.007s, speedup: 0.1, accuracy: 0.94 
+/-0.05
Index size: 6309, exact: 0.001s, LSHF: 0.008s, speedup: 0.2, accuracy: 0.92 
+/-0.07
Index size: 15848, exact: 0.002s, LSHF: 0.008s, speedup: 0.3, accuracy: 0.92 
+/-0.07
Index size: 39810, exact: 0.005s, LSHF: 0.010s, speedup: 0.5, accuracy: 0.84 
+/-0.10
Index size: 10, exact: 0.008s, LSHF: 0.016s, speedup: 0.5, accuracy: 0.80 
+/-0.06

With n_candidates=100, the output is

Index size: 1000, exact: 0.000s, LSHF: 0.006s, speedup: 0.0, accuracy: 1.00 
+/-0.00
Index size: 2511, exact: 0.001s, LSHF: 0.006s, speedup: 0.1, accuracy: 0.94 
+/-0.05
Index size: 6309, exact: 0.001s, LSHF: 0.005s, speedup: 0.2, accuracy: 0.92 
+/-0.07
Index size: 15848, exact: 0.002s, LSHF: 0.007s, speedup: 0.4, accuracy: 0.90 
+/-0.11
Index size: 39810, exact: 0.005s, LSHF: 0.008s, speedup: 0.7, accuracy: 0.82 
+/-0.13
Index size: 10, exact: 0.007s, LSHF: 0.013s, speedup: 0.6, accuracy: 0.78 
+/-0.04



---
Miroslav Batchkarov
PhD Student,
Text Analysis Group,
Department of Informatics,
University of Sussex



--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman
I agree this is disappointing, and we need to work on making LSHForest
faster. Portions should probably be coded in Cython, for instance, as the
current implementation is a bit circuitous in order to work in numpy. PRs
are welcome.

LSHForest could use parallelism to be faster, but so can (and will) the
exact neighbors methods. In theory in LSHForest, each tree could be
stored on entirely different machines, providing memory benefits, but
scikit-learn can't really take advantage of this.

Having said that, I would also try with higher n_features and n_queries. We
have to limit the scale of our examples in order to limit the overall
document compilation time.

On 16 April 2015 at 01:12, Miroslav Batchkarov mbatchka...@gmail.com
wrote:

 Hi everyone,

 was really impressed by the speedups provided by LSHForest compared to
 brute-force search. Out of curiosity, I compared LSRForest to the existing
 ball tree implementation. The approximate algorithm is consistently slower
 (see below). Is this normal and should it be mentioned in the
 documentation? Does approximate search offer any benefits in terms of
 memory usage?


 I ran the same example
 http://scikit-learn.org/stable/auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html#example-neighbors-plot-approximate-nearest-neighbors-scalability-py
  with
 a algorithm=ball_tree. I also had to set metric=‘euclidean’ (this may
 affect results). The output is:

 Index size: 1000, exact: 0.000s, LSHF: 0.007s, speedup: 0.0, accuracy:
 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.007s, speedup: 0.1, accuracy:
 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.008s, speedup: 0.2, accuracy:
 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.008s, speedup: 0.3, accuracy:
 0.92 +/-0.07
 Index size: 39810, exact: 0.005s, LSHF: 0.010s, speedup: 0.5, accuracy:
 0.84 +/-0.10
 Index size: 10, exact: 0.008s, LSHF: 0.016s, speedup: 0.5, accuracy:
 0.80 +/-0.06

 With n_candidates=100, the output is

 Index size: 1000, exact: 0.000s, LSHF: 0.006s, speedup: 0.0, accuracy:
 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.006s, speedup: 0.1, accuracy:
 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.005s, speedup: 0.2, accuracy:
 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.007s, speedup: 0.4, accuracy:
 0.90 +/-0.11
 Index size: 39810, exact: 0.005s, LSHF: 0.008s, speedup: 0.7, accuracy:
 0.82 +/-0.13
 Index size: 10, exact: 0.007s, LSHF: 0.013s, speedup: 0.6, accuracy:
 0.78 +/-0.04



 ---
 Miroslav Batchkarov
 PhD Student,
 Text Analysis Group,
 Department of Informatics,
 University of Sussex





 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with Bonita BPM through live
 exercises
 http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
 event?utm_
 source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman
Oh. Silly mistake. Doesn't break with the correct patch, now at PR#4604...

On 16 April 2015 at 14:24, Joel Nothman joel.noth...@gmail.com wrote:

 Except apparently that commit breaks the code... Maybe I've misunderstood
 something :(

 On 16 April 2015 at 14:18, Joel Nothman joel.noth...@gmail.com wrote:

 ball tree is not vectorized in the sense of SIMD, but there is
 Python/numpy overhead in LSHForest that is not present in ball tree.

 I think one of the problems is the high n_candidates relative to the
 n_neighbors. This really increases the search time.

 Once we're dealing with large enough index and n_candidates, most time is
 spent in searchsorted in the synchronous ascending phase, while any
 overhead around it is marginal. Currently we are searching over the whole
 array in each searchsorted, while it could be rewritten to keep better
 track of context to cut down the overall array when searching. While
 possible, I suspect this will look confusing in Python/Numpy, and Cython
 will be a clearer and faster way to present this logic.

 On the other hand, time spent in _compute_distances is substantial, and
 yet most of its consumption is *outside* of pairwise_distances. This
 commit
 https://github.com/scikit-learn/scikit-learn/commit/c1f335f70aa0f766a930f8ac54eeaa601245725a
 cuts a basic benchmark from 85 to 70 seconds. Vote here for merge
 https://github.com/scikit-learn/scikit-learn/pull/4603!

 On 16 April 2015 at 12:32, Maheshakya Wijewardena pmaheshak...@gmail.com
  wrote:

 Moreover, this drawback occurs because LSHForest does not vectorize
 multiple queries as in 'ball_tree' or any other method. This slows the
 exact neighbor distance calculation down significantly after approximation.
 This will not be a problem if queries are for individual points.
 Unfortunately, former is the more useful usage of LSHForest.
 Are you trying individual queries or multiple queries (for n_samples)?

 On Thu, Apr 16, 2015 at 6:14 AM, Daniel Vainsencher 
 daniel.vainsenc...@gmail.com wrote:

 LHSForest is not intended for dimensions at which exact methods work
 well, nor for tiny datasets. Try d500, n_points10, I don't remember
 the switchover point.

 The documentation should make this clear, but unfortunately I don't see
 that it does.
 On Apr 15, 2015 7:08 PM, Joel Nothman joel.noth...@gmail.com wrote:

 I agree this is disappointing, and we need to work on making LSHForest
 faster. Portions should probably be coded in Cython, for instance, as the
 current implementation is a bit circuitous in order to work in numpy. PRs
 are welcome.

 LSHForest could use parallelism to be faster, but so can (and will)
 the exact neighbors methods. In theory in LSHForest, each tree could be
 stored on entirely different machines, providing memory benefits, but
 scikit-learn can't really take advantage of this.

 Having said that, I would also try with higher n_features and
 n_queries. We have to limit the scale of our examples in order to limit 
 the
 overall document compilation time.

 On 16 April 2015 at 01:12, Miroslav Batchkarov mbatchka...@gmail.com
 wrote:

 Hi everyone,

 was really impressed by the speedups provided by LSHForest compared
 to brute-force search. Out of curiosity, I compared LSRForest to the
 existing ball tree implementation. The approximate algorithm is
 consistently slower (see below). Is this normal and should it be 
 mentioned
 in the documentation? Does approximate search offer any benefits in terms
 of memory usage?


 I ran the same example
 http://scikit-learn.org/stable/auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html#example-neighbors-plot-approximate-nearest-neighbors-scalability-py
  with
 a algorithm=ball_tree. I also had to set metric=‘euclidean’ (this may
 affect results). The output is:

 Index size: 1000, exact: 0.000s, LSHF: 0.007s, speedup: 0.0,
 accuracy: 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.007s, speedup: 0.1,
 accuracy: 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.008s, speedup: 0.2,
 accuracy: 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.008s, speedup: 0.3,
 accuracy: 0.92 +/-0.07
 Index size: 39810, exact: 0.005s, LSHF: 0.010s, speedup: 0.5,
 accuracy: 0.84 +/-0.10
 Index size: 10, exact: 0.008s, LSHF: 0.016s, speedup: 0.5,
 accuracy: 0.80 +/-0.06

 With n_candidates=100, the output is

 Index size: 1000, exact: 0.000s, LSHF: 0.006s, speedup: 0.0,
 accuracy: 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.006s, speedup: 0.1,
 accuracy: 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.005s, speedup: 0.2,
 accuracy: 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.007s, speedup: 0.4,
 accuracy: 0.90 +/-0.11
 Index size: 39810, exact: 0.005s, LSHF: 0.008s, speedup: 0.7,
 accuracy: 0.82 +/-0.13
 Index size: 10, exact: 0.007s, LSHF: 0.013s, speedup: 0.6,
 accuracy: 0.78 +/-0.04



 ---
 Miroslav Batchkarov
 PhD Student,
 Text Analysis Group,
 Department of Informatics,
 

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Daniel Vainsencher
LHSForest is not intended for dimensions at which exact methods work well,
nor for tiny datasets. Try d500, n_points10, I don't remember the
switchover point.

The documentation should make this clear, but unfortunately I don't see
that it does.
On Apr 15, 2015 7:08 PM, Joel Nothman joel.noth...@gmail.com wrote:

 I agree this is disappointing, and we need to work on making LSHForest
 faster. Portions should probably be coded in Cython, for instance, as the
 current implementation is a bit circuitous in order to work in numpy. PRs
 are welcome.

 LSHForest could use parallelism to be faster, but so can (and will) the
 exact neighbors methods. In theory in LSHForest, each tree could be
 stored on entirely different machines, providing memory benefits, but
 scikit-learn can't really take advantage of this.

 Having said that, I would also try with higher n_features and n_queries.
 We have to limit the scale of our examples in order to limit the overall
 document compilation time.

 On 16 April 2015 at 01:12, Miroslav Batchkarov mbatchka...@gmail.com
 wrote:

 Hi everyone,

 was really impressed by the speedups provided by LSHForest compared to
 brute-force search. Out of curiosity, I compared LSRForest to the existing
 ball tree implementation. The approximate algorithm is consistently slower
 (see below). Is this normal and should it be mentioned in the
 documentation? Does approximate search offer any benefits in terms of
 memory usage?


 I ran the same example
 http://scikit-learn.org/stable/auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html#example-neighbors-plot-approximate-nearest-neighbors-scalability-py
  with
 a algorithm=ball_tree. I also had to set metric=‘euclidean’ (this may
 affect results). The output is:

 Index size: 1000, exact: 0.000s, LSHF: 0.007s, speedup: 0.0, accuracy:
 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.007s, speedup: 0.1, accuracy:
 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.008s, speedup: 0.2, accuracy:
 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.008s, speedup: 0.3, accuracy:
 0.92 +/-0.07
 Index size: 39810, exact: 0.005s, LSHF: 0.010s, speedup: 0.5, accuracy:
 0.84 +/-0.10
 Index size: 10, exact: 0.008s, LSHF: 0.016s, speedup: 0.5, accuracy:
 0.80 +/-0.06

 With n_candidates=100, the output is

 Index size: 1000, exact: 0.000s, LSHF: 0.006s, speedup: 0.0, accuracy:
 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.006s, speedup: 0.1, accuracy:
 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.005s, speedup: 0.2, accuracy:
 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.007s, speedup: 0.4, accuracy:
 0.90 +/-0.11
 Index size: 39810, exact: 0.005s, LSHF: 0.008s, speedup: 0.7, accuracy:
 0.82 +/-0.13
 Index size: 10, exact: 0.007s, LSHF: 0.013s, speedup: 0.6, accuracy:
 0.78 +/-0.04



 ---
 Miroslav Batchkarov
 PhD Student,
 Text Analysis Group,
 Department of Informatics,
 University of Sussex





 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with Bonita BPM through live
 exercises
 http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
 event?utm_
 source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with Bonita BPM through live
 exercises
 http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
 event?utm_
 source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman
ball tree is not vectorized in the sense of SIMD, but there is Python/numpy
overhead in LSHForest that is not present in ball tree.

I think one of the problems is the high n_candidates relative to the
n_neighbors. This really increases the search time.

Once we're dealing with large enough index and n_candidates, most time is
spent in searchsorted in the synchronous ascending phase, while any
overhead around it is marginal. Currently we are searching over the whole
array in each searchsorted, while it could be rewritten to keep better
track of context to cut down the overall array when searching. While
possible, I suspect this will look confusing in Python/Numpy, and Cython
will be a clearer and faster way to present this logic.

On the other hand, time spent in _compute_distances is substantial, and yet
most of its consumption is *outside* of pairwise_distances. This commit
https://github.com/scikit-learn/scikit-learn/commit/c1f335f70aa0f766a930f8ac54eeaa601245725a
cuts a basic benchmark from 85 to 70 seconds. Vote here for merge
https://github.com/scikit-learn/scikit-learn/pull/4603!

On 16 April 2015 at 12:32, Maheshakya Wijewardena pmaheshak...@gmail.com
wrote:

 Moreover, this drawback occurs because LSHForest does not vectorize
 multiple queries as in 'ball_tree' or any other method. This slows the
 exact neighbor distance calculation down significantly after approximation.
 This will not be a problem if queries are for individual points.
 Unfortunately, former is the more useful usage of LSHForest.
 Are you trying individual queries or multiple queries (for n_samples)?

 On Thu, Apr 16, 2015 at 6:14 AM, Daniel Vainsencher 
 daniel.vainsenc...@gmail.com wrote:

 LHSForest is not intended for dimensions at which exact methods work
 well, nor for tiny datasets. Try d500, n_points10, I don't remember
 the switchover point.

 The documentation should make this clear, but unfortunately I don't see
 that it does.
 On Apr 15, 2015 7:08 PM, Joel Nothman joel.noth...@gmail.com wrote:

 I agree this is disappointing, and we need to work on making LSHForest
 faster. Portions should probably be coded in Cython, for instance, as the
 current implementation is a bit circuitous in order to work in numpy. PRs
 are welcome.

 LSHForest could use parallelism to be faster, but so can (and will) the
 exact neighbors methods. In theory in LSHForest, each tree could be
 stored on entirely different machines, providing memory benefits, but
 scikit-learn can't really take advantage of this.

 Having said that, I would also try with higher n_features and n_queries.
 We have to limit the scale of our examples in order to limit the overall
 document compilation time.

 On 16 April 2015 at 01:12, Miroslav Batchkarov mbatchka...@gmail.com
 wrote:

 Hi everyone,

 was really impressed by the speedups provided by LSHForest compared to
 brute-force search. Out of curiosity, I compared LSRForest to the existing
 ball tree implementation. The approximate algorithm is consistently slower
 (see below). Is this normal and should it be mentioned in the
 documentation? Does approximate search offer any benefits in terms of
 memory usage?


 I ran the same example
 http://scikit-learn.org/stable/auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html#example-neighbors-plot-approximate-nearest-neighbors-scalability-py
  with
 a algorithm=ball_tree. I also had to set metric=‘euclidean’ (this may
 affect results). The output is:

 Index size: 1000, exact: 0.000s, LSHF: 0.007s, speedup: 0.0, accuracy:
 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.007s, speedup: 0.1, accuracy:
 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.008s, speedup: 0.2, accuracy:
 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.008s, speedup: 0.3, accuracy:
 0.92 +/-0.07
 Index size: 39810, exact: 0.005s, LSHF: 0.010s, speedup: 0.5, accuracy:
 0.84 +/-0.10
 Index size: 10, exact: 0.008s, LSHF: 0.016s, speedup: 0.5,
 accuracy: 0.80 +/-0.06

 With n_candidates=100, the output is

 Index size: 1000, exact: 0.000s, LSHF: 0.006s, speedup: 0.0, accuracy:
 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.006s, speedup: 0.1, accuracy:
 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.005s, speedup: 0.2, accuracy:
 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.007s, speedup: 0.4, accuracy:
 0.90 +/-0.11
 Index size: 39810, exact: 0.005s, LSHF: 0.008s, speedup: 0.7, accuracy:
 0.82 +/-0.13
 Index size: 10, exact: 0.007s, LSHF: 0.013s, speedup: 0.6,
 accuracy: 0.78 +/-0.04



 ---
 Miroslav Batchkarov
 PhD Student,
 Text Analysis Group,
 Department of Informatics,
 University of Sussex





 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with Bonita BPM through live
 exercises
 

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Maheshakya Wijewardena
Moreover, this drawback occurs because LSHForest does not vectorize
multiple queries as in 'ball_tree' or any other method. This slows the
exact neighbor distance calculation down significantly after approximation.
This will not be a problem if queries are for individual points.
Unfortunately, former is the more useful usage of LSHForest.
Are you trying individual queries or multiple queries (for n_samples)?

On Thu, Apr 16, 2015 at 6:14 AM, Daniel Vainsencher 
daniel.vainsenc...@gmail.com wrote:

 LHSForest is not intended for dimensions at which exact methods work well,
 nor for tiny datasets. Try d500, n_points10, I don't remember the
 switchover point.

 The documentation should make this clear, but unfortunately I don't see
 that it does.
 On Apr 15, 2015 7:08 PM, Joel Nothman joel.noth...@gmail.com wrote:

 I agree this is disappointing, and we need to work on making LSHForest
 faster. Portions should probably be coded in Cython, for instance, as the
 current implementation is a bit circuitous in order to work in numpy. PRs
 are welcome.

 LSHForest could use parallelism to be faster, but so can (and will) the
 exact neighbors methods. In theory in LSHForest, each tree could be
 stored on entirely different machines, providing memory benefits, but
 scikit-learn can't really take advantage of this.

 Having said that, I would also try with higher n_features and n_queries.
 We have to limit the scale of our examples in order to limit the overall
 document compilation time.

 On 16 April 2015 at 01:12, Miroslav Batchkarov mbatchka...@gmail.com
 wrote:

 Hi everyone,

 was really impressed by the speedups provided by LSHForest compared to
 brute-force search. Out of curiosity, I compared LSRForest to the existing
 ball tree implementation. The approximate algorithm is consistently slower
 (see below). Is this normal and should it be mentioned in the
 documentation? Does approximate search offer any benefits in terms of
 memory usage?


 I ran the same example
 http://scikit-learn.org/stable/auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html#example-neighbors-plot-approximate-nearest-neighbors-scalability-py
  with
 a algorithm=ball_tree. I also had to set metric=‘euclidean’ (this may
 affect results). The output is:

 Index size: 1000, exact: 0.000s, LSHF: 0.007s, speedup: 0.0, accuracy:
 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.007s, speedup: 0.1, accuracy:
 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.008s, speedup: 0.2, accuracy:
 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.008s, speedup: 0.3, accuracy:
 0.92 +/-0.07
 Index size: 39810, exact: 0.005s, LSHF: 0.010s, speedup: 0.5, accuracy:
 0.84 +/-0.10
 Index size: 10, exact: 0.008s, LSHF: 0.016s, speedup: 0.5, accuracy:
 0.80 +/-0.06

 With n_candidates=100, the output is

 Index size: 1000, exact: 0.000s, LSHF: 0.006s, speedup: 0.0, accuracy:
 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.006s, speedup: 0.1, accuracy:
 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.005s, speedup: 0.2, accuracy:
 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.007s, speedup: 0.4, accuracy:
 0.90 +/-0.11
 Index size: 39810, exact: 0.005s, LSHF: 0.008s, speedup: 0.7, accuracy:
 0.82 +/-0.13
 Index size: 10, exact: 0.007s, LSHF: 0.013s, speedup: 0.6, accuracy:
 0.78 +/-0.04



 ---
 Miroslav Batchkarov
 PhD Student,
 Text Analysis Group,
 Department of Informatics,
 University of Sussex





 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with Bonita BPM through live
 exercises
 http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
 event?utm_
 source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with Bonita BPM through live
 exercises
 http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
 event?utm_
 source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance with the BPMN 2 standard
 Learn Process modeling best practices 

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman
Except apparently that commit breaks the code... Maybe I've misunderstood
something :(

On 16 April 2015 at 14:18, Joel Nothman joel.noth...@gmail.com wrote:

 ball tree is not vectorized in the sense of SIMD, but there is
 Python/numpy overhead in LSHForest that is not present in ball tree.

 I think one of the problems is the high n_candidates relative to the
 n_neighbors. This really increases the search time.

 Once we're dealing with large enough index and n_candidates, most time is
 spent in searchsorted in the synchronous ascending phase, while any
 overhead around it is marginal. Currently we are searching over the whole
 array in each searchsorted, while it could be rewritten to keep better
 track of context to cut down the overall array when searching. While
 possible, I suspect this will look confusing in Python/Numpy, and Cython
 will be a clearer and faster way to present this logic.

 On the other hand, time spent in _compute_distances is substantial, and
 yet most of its consumption is *outside* of pairwise_distances. This
 commit
 https://github.com/scikit-learn/scikit-learn/commit/c1f335f70aa0f766a930f8ac54eeaa601245725a
 cuts a basic benchmark from 85 to 70 seconds. Vote here for merge
 https://github.com/scikit-learn/scikit-learn/pull/4603!

 On 16 April 2015 at 12:32, Maheshakya Wijewardena pmaheshak...@gmail.com
 wrote:

 Moreover, this drawback occurs because LSHForest does not vectorize
 multiple queries as in 'ball_tree' or any other method. This slows the
 exact neighbor distance calculation down significantly after approximation.
 This will not be a problem if queries are for individual points.
 Unfortunately, former is the more useful usage of LSHForest.
 Are you trying individual queries or multiple queries (for n_samples)?

 On Thu, Apr 16, 2015 at 6:14 AM, Daniel Vainsencher 
 daniel.vainsenc...@gmail.com wrote:

 LHSForest is not intended for dimensions at which exact methods work
 well, nor for tiny datasets. Try d500, n_points10, I don't remember
 the switchover point.

 The documentation should make this clear, but unfortunately I don't see
 that it does.
 On Apr 15, 2015 7:08 PM, Joel Nothman joel.noth...@gmail.com wrote:

 I agree this is disappointing, and we need to work on making LSHForest
 faster. Portions should probably be coded in Cython, for instance, as the
 current implementation is a bit circuitous in order to work in numpy. PRs
 are welcome.

 LSHForest could use parallelism to be faster, but so can (and will) the
 exact neighbors methods. In theory in LSHForest, each tree could be
 stored on entirely different machines, providing memory benefits, but
 scikit-learn can't really take advantage of this.

 Having said that, I would also try with higher n_features and
 n_queries. We have to limit the scale of our examples in order to limit the
 overall document compilation time.

 On 16 April 2015 at 01:12, Miroslav Batchkarov mbatchka...@gmail.com
 wrote:

 Hi everyone,

 was really impressed by the speedups provided by LSHForest compared to
 brute-force search. Out of curiosity, I compared LSRForest to the existing
 ball tree implementation. The approximate algorithm is consistently slower
 (see below). Is this normal and should it be mentioned in the
 documentation? Does approximate search offer any benefits in terms of
 memory usage?


 I ran the same example
 http://scikit-learn.org/stable/auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html#example-neighbors-plot-approximate-nearest-neighbors-scalability-py
  with
 a algorithm=ball_tree. I also had to set metric=‘euclidean’ (this may
 affect results). The output is:

 Index size: 1000, exact: 0.000s, LSHF: 0.007s, speedup: 0.0, accuracy:
 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.007s, speedup: 0.1, accuracy:
 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.008s, speedup: 0.2, accuracy:
 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.008s, speedup: 0.3,
 accuracy: 0.92 +/-0.07
 Index size: 39810, exact: 0.005s, LSHF: 0.010s, speedup: 0.5,
 accuracy: 0.84 +/-0.10
 Index size: 10, exact: 0.008s, LSHF: 0.016s, speedup: 0.5,
 accuracy: 0.80 +/-0.06

 With n_candidates=100, the output is

 Index size: 1000, exact: 0.000s, LSHF: 0.006s, speedup: 0.0, accuracy:
 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.006s, speedup: 0.1, accuracy:
 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.005s, speedup: 0.2, accuracy:
 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.007s, speedup: 0.4,
 accuracy: 0.90 +/-0.11
 Index size: 39810, exact: 0.005s, LSHF: 0.008s, speedup: 0.7,
 accuracy: 0.82 +/-0.13
 Index size: 10, exact: 0.007s, LSHF: 0.013s, speedup: 0.6,
 accuracy: 0.78 +/-0.04



 ---
 Miroslav Batchkarov
 PhD Student,
 Text Analysis Group,
 Department of Informatics,
 University of Sussex





 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM