I did a rough implementation, setting b = min_hash_match. The result I got
from running the benchmark is attached. It was able to roughly triple the
speed-up of kneighbors function for large index sizes. However, this
implementation adds some overhead to fitting time as there are 2**b *
It's nice to see some decent speed-up factors, though the accuracy tradeoff
is still not so great. Still, I'd like to see the code and where we can go
from here. Great work so far!
On 7 August 2015 at 07:50, Maheshakya Wijewardena pmaheshak...@gmail.com
wrote:
I did a rough implementation,
Thanks, I look forward to this being improved, while I have little
availability to help myself atm.
On 2 August 2015 at 22:58, Maheshakya Wijewardena pmaheshak...@gmail.com
wrote:
I agree with Joel. Profiling indicated that 69.8% of total time of
kneighbors is spent on _find_matching_indices
I agree with Joel. Profiling indicated that 69.8% of total time of
kneighbors is spent on _find_matching_indices and 22.9% is spent on
_compute_distances. So I'll give priority to work on _find_matching_indices
with the method you suggested.
On Sun, Aug 2, 2015 at 10:51 AM, Maheshakya Wijewardena
Hi Joel,
I was on vacation during past 3 days. I''ll look into this asap and let you
all know.
I also did some profiling, but only with the usage of `pairwise_distance`
method. Brute force technique directly uses that for the entire query
array, but LSH uses that in a loop and I noticed there is
@Maheshakya, will you be able to do work in the near future on speeding up
the ascending phase instead? Or should another contributor take that baton?
Not only does it seem to be a major contributor to runtime, but it is
independent of metric and hashing mechanism (within binary hashes), and
hence
Hi,
This is absolutely great news. Thanks a lot. Please do open a WIP PR. We
(at INRIA) were planning to allocate time from someone this summer to
work on this. So you'll have someone reviewing / advicing.
With regards to releasing the gil, you need to use the 'with nogil'
statement in cython.
What makes you think this is the main bottleneck? While it is not an
insignificant consumer of time, I really doubt this is what's making
scikit-learn's LSH implementation severely underperform with respect to
other implementations.
We need to profile. In order to do that, we need some sensible
One approach to fixing the ascending phase would ensure that
_find_matching_indices is only searching over parts of the tree that have
not yet been explored, while currently it searches over the entire index at
each depth.
My preferred, but more experimental, solution is to memoize where the
(sorry, I should have said the first b layers, not 2**b layers, producing a
memoization of 2**b offsets)
On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com wrote:
One approach to fixing the ascending phase would ensure that
_find_matching_indices is only searching over parts of the
Hi,
I've started to look into the matter of improving performance of LSHForest.
As we have discussed sometime before(in fact, quite a long time), main
concern is to Cythonize distance calculations. Currently, this done by
iteratively moving over all the query vectors when `kneighbors` method is
11 matches
Mail list logo