On 04/19/2015 08:18 AM, Joel Nothman wrote:
On 17 April 2015 at 13:52, Daniel Vainsencher
daniel.vainsenc...@gmail.com mailto:daniel.vainsenc...@gmail.com wrote:
On 04/16/2015 05:49 PM, Joel Nothman wrote:
I more or less agree. Certainly we only need to do one searchsorted per
On 17 April 2015 at 13:52, Daniel Vainsencher daniel.vainsenc...@gmail.com
wrote:
On 04/16/2015 05:49 PM, Joel Nothman wrote:
I more or less agree. Certainly we only need to do one searchsorted per
query per tree, and then do linear scans. There is a question of how
close we stay to the
Hi Joel,
To extend your analysis:
- when n_samples*n_indices is large enough, the bottleneck is the use of
the index, as you say.
- when n_dimensions*n_candidates is large enough, the bottleneck is
computation of true distances between DB points and the query.
To serve well both kinds of use
I more or less agree. Certainly we only need to do one searchsorted per
query per tree, and then do linear scans. There is a question of how close
we stay to the original LSHForest algorithm, which relies on matching
prefixes rather than hamming distance. Hamming distance is easier to
calculate in
On 04/16/2015 05:49 PM, Joel Nothman wrote:
I more or less agree. Certainly we only need to do one searchsorted per
query per tree, and then do linear scans. There is a question of how
close we stay to the original LSHForest algorithm, which relies on
matching prefixes rather than hamming
Hi everyone,
was really impressed by the speedups provided by LSHForest compared to
brute-force search. Out of curiosity, I compared LSRForest to the existing ball
tree implementation. The approximate algorithm is consistently slower (see
below). Is this normal and should it be mentioned in
I agree this is disappointing, and we need to work on making LSHForest
faster. Portions should probably be coded in Cython, for instance, as the
current implementation is a bit circuitous in order to work in numpy. PRs
are welcome.
LSHForest could use parallelism to be faster, but so can (and
Oh. Silly mistake. Doesn't break with the correct patch, now at PR#4604...
On 16 April 2015 at 14:24, Joel Nothman joel.noth...@gmail.com wrote:
Except apparently that commit breaks the code... Maybe I've misunderstood
something :(
On 16 April 2015 at 14:18, Joel Nothman
LHSForest is not intended for dimensions at which exact methods work well,
nor for tiny datasets. Try d500, n_points10, I don't remember the
switchover point.
The documentation should make this clear, but unfortunately I don't see
that it does.
On Apr 15, 2015 7:08 PM, Joel Nothman
ball tree is not vectorized in the sense of SIMD, but there is Python/numpy
overhead in LSHForest that is not present in ball tree.
I think one of the problems is the high n_candidates relative to the
n_neighbors. This really increases the search time.
Once we're dealing with large enough index
Moreover, this drawback occurs because LSHForest does not vectorize
multiple queries as in 'ball_tree' or any other method. This slows the
exact neighbor distance calculation down significantly after approximation.
This will not be a problem if queries are for individual points.
Unfortunately,
Except apparently that commit breaks the code... Maybe I've misunderstood
something :(
On 16 April 2015 at 14:18, Joel Nothman joel.noth...@gmail.com wrote:
ball tree is not vectorized in the sense of SIMD, but there is
Python/numpy overhead in LSHForest that is not present in ball tree.
I
12 matches
Mail list logo