On 04/19/2015 08:18 AM, Joel Nothman wrote:
>
>
> On 17 April 2015 at 13:52, Daniel Vainsencher
> mailto:daniel.vainsenc...@gmail.com>> wrote:
>
> On 04/16/2015 05:49 PM, Joel Nothman wrote:
> > I more or less agree. Certainly we only need to do one searchsorted per
> > query per tree,
On 17 April 2015 at 13:52, Daniel Vainsencher
wrote:
> On 04/16/2015 05:49 PM, Joel Nothman wrote:
> > I more or less agree. Certainly we only need to do one searchsorted per
> > query per tree, and then do linear scans. There is a question of how
> > close we stay to the original LSHForest algor
On 04/16/2015 05:49 PM, Joel Nothman wrote:
> I more or less agree. Certainly we only need to do one searchsorted per
> query per tree, and then do linear scans. There is a question of how
> close we stay to the original LSHForest algorithm, which relies on
> matching prefixes rather than hamming d
I more or less agree. Certainly we only need to do one searchsorted per
query per tree, and then do linear scans. There is a question of how close
we stay to the original LSHForest algorithm, which relies on matching
prefixes rather than hamming distance. Hamming distance is easier to
calculate in
Hi Joel,
To extend your analysis:
- when n_samples*n_indices is large enough, the bottleneck is the use of
the index, as you say.
- when n_dimensions*n_candidates is large enough, the bottleneck is
computation of true distances between DB points and the query.
To serve well both kinds of use ca
Oh. Silly mistake. Doesn't break with the correct patch, now at PR#4604...
On 16 April 2015 at 14:24, Joel Nothman wrote:
> Except apparently that commit breaks the code... Maybe I've misunderstood
> something :(
>
> On 16 April 2015 at 14:18, Joel Nothman wrote:
>
>> ball tree is not vectorize
Except apparently that commit breaks the code... Maybe I've misunderstood
something :(
On 16 April 2015 at 14:18, Joel Nothman wrote:
> ball tree is not vectorized in the sense of SIMD, but there is
> Python/numpy overhead in LSHForest that is not present in ball tree.
>
> I think one of the pro
ball tree is not vectorized in the sense of SIMD, but there is Python/numpy
overhead in LSHForest that is not present in ball tree.
I think one of the problems is the high n_candidates relative to the
n_neighbors. This really increases the search time.
Once we're dealing with large enough index a
Moreover, this drawback occurs because LSHForest does not vectorize
multiple queries as in 'ball_tree' or any other method. This slows the
exact neighbor distance calculation down significantly after approximation.
This will not be a problem if queries are for individual points.
Unfortunately, form
LHSForest is not intended for dimensions at which exact methods work well,
nor for tiny datasets. Try d>500, n_points>10, I don't remember the
switchover point.
The documentation should make this clear, but unfortunately I don't see
that it does.
On Apr 15, 2015 7:08 PM, "Joel Nothman" wrote:
I agree this is disappointing, and we need to work on making LSHForest
faster. Portions should probably be coded in Cython, for instance, as the
current implementation is a bit circuitous in order to work in numpy. PRs
are welcome.
LSHForest could use parallelism to be faster, but so can (and will
11 matches
Mail list logo