Re: [Scikit-learn-general] Performance of LSHForest

2015-04-20 Thread Daniel Vainsencher
On 04/19/2015 08:18 AM, Joel Nothman wrote: On 17 April 2015 at 13:52, Daniel Vainsencher daniel.vainsenc...@gmail.com mailto:daniel.vainsenc...@gmail.com wrote: On 04/16/2015 05:49 PM, Joel Nothman wrote: I more or less agree. Certainly we only need to do one searchsorted per

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-19 Thread Joel Nothman
On 17 April 2015 at 13:52, Daniel Vainsencher daniel.vainsenc...@gmail.com wrote: On 04/16/2015 05:49 PM, Joel Nothman wrote: I more or less agree. Certainly we only need to do one searchsorted per query per tree, and then do linear scans. There is a question of how close we stay to the

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-16 Thread Daniel Vainsencher
Hi Joel, To extend your analysis: - when n_samples*n_indices is large enough, the bottleneck is the use of the index, as you say. - when n_dimensions*n_candidates is large enough, the bottleneck is computation of true distances between DB points and the query. To serve well both kinds of use

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-16 Thread Joel Nothman
I more or less agree. Certainly we only need to do one searchsorted per query per tree, and then do linear scans. There is a question of how close we stay to the original LSHForest algorithm, which relies on matching prefixes rather than hamming distance. Hamming distance is easier to calculate in

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-16 Thread Daniel Vainsencher
On 04/16/2015 05:49 PM, Joel Nothman wrote: I more or less agree. Certainly we only need to do one searchsorted per query per tree, and then do linear scans. There is a question of how close we stay to the original LSHForest algorithm, which relies on matching prefixes rather than hamming

[Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Miroslav Batchkarov
Hi everyone, was really impressed by the speedups provided by LSHForest compared to brute-force search. Out of curiosity, I compared LSRForest to the existing ball tree implementation. The approximate algorithm is consistently slower (see below). Is this normal and should it be mentioned in

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman
I agree this is disappointing, and we need to work on making LSHForest faster. Portions should probably be coded in Cython, for instance, as the current implementation is a bit circuitous in order to work in numpy. PRs are welcome. LSHForest could use parallelism to be faster, but so can (and

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman
Oh. Silly mistake. Doesn't break with the correct patch, now at PR#4604... On 16 April 2015 at 14:24, Joel Nothman joel.noth...@gmail.com wrote: Except apparently that commit breaks the code... Maybe I've misunderstood something :( On 16 April 2015 at 14:18, Joel Nothman

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Daniel Vainsencher
LHSForest is not intended for dimensions at which exact methods work well, nor for tiny datasets. Try d500, n_points10, I don't remember the switchover point. The documentation should make this clear, but unfortunately I don't see that it does. On Apr 15, 2015 7:08 PM, Joel Nothman

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman
ball tree is not vectorized in the sense of SIMD, but there is Python/numpy overhead in LSHForest that is not present in ball tree. I think one of the problems is the high n_candidates relative to the n_neighbors. This really increases the search time. Once we're dealing with large enough index

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Maheshakya Wijewardena
Moreover, this drawback occurs because LSHForest does not vectorize multiple queries as in 'ball_tree' or any other method. This slows the exact neighbor distance calculation down significantly after approximation. This will not be a problem if queries are for individual points. Unfortunately,

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman
Except apparently that commit breaks the code... Maybe I've misunderstood something :( On 16 April 2015 at 14:18, Joel Nothman joel.noth...@gmail.com wrote: ball tree is not vectorized in the sense of SIMD, but there is Python/numpy overhead in LSHForest that is not present in ball tree. I