subject:"Re\: \[Scikit\-learn\-general\] Performance of LSHForest"

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-20 Thread Daniel Vainsencher

On 04/19/2015 08:18 AM, Joel Nothman wrote: > > > On 17 April 2015 at 13:52, Daniel Vainsencher > mailto:daniel.vainsenc...@gmail.com>> wrote: > > On 04/16/2015 05:49 PM, Joel Nothman wrote: > > I more or less agree. Certainly we only need to do one searchsorted per > > query per tree,

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-19 Thread Joel Nothman

On 17 April 2015 at 13:52, Daniel Vainsencher wrote: > On 04/16/2015 05:49 PM, Joel Nothman wrote: > > I more or less agree. Certainly we only need to do one searchsorted per > > query per tree, and then do linear scans. There is a question of how > > close we stay to the original LSHForest algor

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-16 Thread Daniel Vainsencher

On 04/16/2015 05:49 PM, Joel Nothman wrote: > I more or less agree. Certainly we only need to do one searchsorted per > query per tree, and then do linear scans. There is a question of how > close we stay to the original LSHForest algorithm, which relies on > matching prefixes rather than hamming d

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-16 Thread Joel Nothman

I more or less agree. Certainly we only need to do one searchsorted per query per tree, and then do linear scans. There is a question of how close we stay to the original LSHForest algorithm, which relies on matching prefixes rather than hamming distance. Hamming distance is easier to calculate in

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-16 Thread Daniel Vainsencher

Hi Joel, To extend your analysis: - when n_samples*n_indices is large enough, the bottleneck is the use of the index, as you say. - when n_dimensions*n_candidates is large enough, the bottleneck is computation of true distances between DB points and the query. To serve well both kinds of use ca

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman

Oh. Silly mistake. Doesn't break with the correct patch, now at PR#4604... On 16 April 2015 at 14:24, Joel Nothman wrote: > Except apparently that commit breaks the code... Maybe I've misunderstood > something :( > > On 16 April 2015 at 14:18, Joel Nothman wrote: > >> ball tree is not vectorize

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman

Except apparently that commit breaks the code... Maybe I've misunderstood something :( On 16 April 2015 at 14:18, Joel Nothman wrote: > ball tree is not vectorized in the sense of SIMD, but there is > Python/numpy overhead in LSHForest that is not present in ball tree. > > I think one of the pro

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman

ball tree is not vectorized in the sense of SIMD, but there is Python/numpy overhead in LSHForest that is not present in ball tree. I think one of the problems is the high n_candidates relative to the n_neighbors. This really increases the search time. Once we're dealing with large enough index a

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Maheshakya Wijewardena

Moreover, this drawback occurs because LSHForest does not vectorize multiple queries as in 'ball_tree' or any other method. This slows the exact neighbor distance calculation down significantly after approximation. This will not be a problem if queries are for individual points. Unfortunately, form

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Daniel Vainsencher

LHSForest is not intended for dimensions at which exact methods work well, nor for tiny datasets. Try d>500, n_points>10, I don't remember the switchover point. The documentation should make this clear, but unfortunately I don't see that it does. On Apr 15, 2015 7:08 PM, "Joel Nothman" wrote:

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman

I agree this is disappointing, and we need to work on making LSHForest faster. Portions should probably be coded in Cython, for instance, as the current implementation is a bit circuitous in order to work in numpy. PRs are welcome. LSHForest could use parallelism to be faster, but so can (and will

Re: [Scikit-learn-general] Performance of LSHForest

Re: [Scikit-learn-general] Performance of LSHForest

Re: [Scikit-learn-general] Performance of LSHForest

Re: [Scikit-learn-general] Performance of LSHForest

Re: [Scikit-learn-general] Performance of LSHForest

Re: [Scikit-learn-general] Performance of LSHForest

Re: [Scikit-learn-general] Performance of LSHForest

Re: [Scikit-learn-general] Performance of LSHForest

Re: [Scikit-learn-general] Performance of LSHForest

Re: [Scikit-learn-general] Performance of LSHForest

Re: [Scikit-learn-general] Performance of LSHForest

11 matches

Site Navigation

Mail list logo

Footer information