Re: [Scikit-learn-general] Ball tree - different metrics

2015-05-14 Thread Jacob Vanderplas
Run time will depend highly on your metric. If your metric takes twice as long as the euclidean metric for a single distance computation, then you can expect the query to take roughly twice as long (with some variance due to query efficiency). Regarding hacking the tree code to implement your metri

Re: [Scikit-learn-general] Ball tree - different metrics

2015-05-14 Thread nafise mehdipoor
As the ball tree class implementation of scikit-learn for Euclidean (for example) and for large amount of data has an acceptable time, I just want to get sure if I do so, the performance will get considerably better? If yes, so it's just worth that I give it a try. I need a time less that a minu

Re: [Scikit-learn-general] Ball tree - different metrics

2015-05-14 Thread Jacob Vanderplas
If you're computing neighbors of very high dimensional data, you might consider an approximate search instead, for example one based on Locality Sensitive Hashing. There are some papers out there about kernelized LSH which allow a wide range of kernels, and many metrics can be approximated with a j

Re: [Scikit-learn-general] Ball tree - different metrics

2015-05-14 Thread Jacob Vanderplas
Sorry – I should have better specified what I meant. A user-defined metric passed as a python function will always be slow, no matter how you define that python function. If you want to create a custom metric that works quickly within BallTree, you'll have to write it in cython *within* the sciki

Re: [Scikit-learn-general] Ball tree - different metrics

2015-05-14 Thread nafise mehdipoor
I just tried the one with compiling my metric with Cython and it still is too far away from what I need it to be (around 60 seconds)! Still for the previous code: using 'Euclidean' metric: 30 secmetric as a function: 43 minmetric as a python module: 27 mindo you have any other solution that I ca

Re: [Scikit-learn-general] Ball tree - different metrics

2015-05-14 Thread nafise mehdipoor
Thank you so much. It was an example and the real metric will be defined within ball-tree-valid-metric conditions. The best. On Thursday, May 14, 2015 5:47 PM, Jacob Vanderplas wrote: User-defined metrics will always be slow, because they rely on the Python layer for callbacks. Th

Re: [Scikit-learn-general] Ball tree - different metrics

2015-05-14 Thread Jacob Vanderplas
User-defined metrics will always be slow, because they rely on the Python layer for callbacks. The only way to improve it is to write your own metric in Cython and re-compile the ball tree/kd tree code with your new addition. Furthermore, you have to make sure your metric actually meets the definit

Re: [Scikit-learn-general] Ball tree - different metrics

2015-05-14 Thread nafise mehdipoor
Thank you all for the explanation.Here I am trying to implement the balltree based DBSCAN algorithm and I am using the balltree class. Finally, as I meant to use the user-defined metric, I tried it with the code below. The time for this case is awfulDo you have any advice to improve it?Thank you

Re: [Scikit-learn-general] Ball tree - different metrics

2015-05-14 Thread Michael Eickenberg
Using EuclideanDistance: this can actually explain a factor of 10 (probably dimensionality-dependent) In [56]: d = sklearn.neighbors.dist_metrics.EuclideanDistance() In [57]: %timeit d.pairwise(x, y) 100 loops, best of 3: 6.04 ms per loop In [58]: %timeit d.pairwise(x, y) 100 loops, best of 3:

Re: [Scikit-learn-general] Ball tree - different metrics

2015-05-14 Thread Michael Eickenberg
Oops, I overlooked that the comparison was against EuclideanDistance proper. But as Jake says, some optimization does happen in the l2 case and in the example I gave it explains a factor of 5-6. On Thu, May 14, 2015 at 3:44 PM, Michael Eickenberg < michael.eickenb...@gmail.com> wrote: > There may

Re: [Scikit-learn-general] Ball tree - different metrics

2015-05-14 Thread Michael Eickenberg
There may also be some implicit optimization going on. Without looking at the (non pure-python) source of the Minkowski metric, a few tests hint at the fact that p=2 and p=1 are probably treated differently from the rest, because some mathematical tricks can be applied that avoid calculating the f

Re: [Scikit-learn-general] Ball tree - different metrics

2015-05-14 Thread Jacob Vanderplas
Hi Nafise, In general, yes the execution time will depend on the metric, and in a way that can be very difficult to predict. One reason is that because the euclidean metric is the most common, that case is slightly optimized. But that won't lead to a factor of ten. The Ball Tree works by dividing