Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Gael Varoquaux
Hi, This is absolutely great news. Thanks a lot. Please do open a WIP PR. We (at INRIA) were planning to allocate time from someone this summer to work on this. So you'll have someone reviewing / advicing. With regards to releasing the gil, you need to use the 'with nogil' statement in cython.

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Joel Nothman
What makes you think this is the main bottleneck? While it is not an insignificant consumer of time, I really doubt this is what's making scikit-learn's LSH implementation severely underperform with respect to other implementations. We need to profile. In order to do that, we need some sensible

Re: [Scikit-learn-general] Possible code contribution (Poisson loss)

2015-07-30 Thread Jan Hendrik Metzen
That's true, I wasn't aware that score_samples is used already in the context of density estimation. score_samples would be okay then in my opinion. Jan On 29.07.2015 18:46, Andreas Mueller wrote: Hm, I'm not entirely sure how score_samples is currently used, but I think it is the

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Joel Nothman
One approach to fixing the ascending phase would ensure that _find_matching_indices is only searching over parts of the tree that have not yet been explored, while currently it searches over the entire index at each depth. My preferred, but more experimental, solution is to memoize where the

Re: [Scikit-learn-general] Possible code contribution (Poisson loss)

2015-07-30 Thread Mathieu Blondel
While the Gaussian distribution has a PDF, the Poisson distribution has a PMF. From Wikipedia (https://en.wikipedia.org/wiki/Probability_mass_function ): A probability mass function differs from a probability density function (pdf) in that the latter is associated with continuous rather than

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Joel Nothman
(sorry, I should have said the first b layers, not 2**b layers, producing a memoization of 2**b offsets) On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com wrote: One approach to fixing the ascending phase would ensure that _find_matching_indices is only searching over parts of the

Re: [Scikit-learn-general] Possible code contribution (Poisson loss)

2015-07-30 Thread Mathieu Blondel
On Thu, Jul 30, 2015 at 11:38 PM, Andreas Mueller t3k...@gmail.com wrote: I am mostly concerned about API explosion. I take your point of PDF vs PMF. Maybe predict_proba(X, y) is better. Would you also support predict_proba(X, y) for classifiers (which would be

Re: [Scikit-learn-general] KMedoids algorithm in Scikit-Learn

2015-07-30 Thread Sebastian Raschka
Yes, I may be far more expensive than k-means. I just used it with Euclidean distance -- was for a comparison. I think k-medoids can still be useful for smaller, maybe noisier datasets, or if you have some distance measure were calculating averages may not make sense. On Jul 30, 2015, at

Re: [Scikit-learn-general] Possible code contribution (Poisson loss)

2015-07-30 Thread Brian Scannell
I support the inclusion of Poisson loss, although a quick note on predict_prob_at: The output of Poisson regression is a posterior distribution over the rate parameter in the form of a Gamma distribution. If we assume no uncertainty at all in the prediction, the posterior predictive distribution

Re: [Scikit-learn-general] KMedoids algorithm in Scikit-Learn

2015-07-30 Thread Sebastian Raschka
I was looking for K-Medoids too couple of weeks ago and ended up implementing it myself -- but more like quick dirty. I would really welcome a nice and efficient implementation of available via scikit, for example, using voronoi iteration. Best, Sebastian On Jul 30, 2015, at 1:51 PM, Timo

[Scikit-learn-general] KMedoids algorithm in Scikit-Learn

2015-07-30 Thread Timo Erkkilä
Hi all, I checked and could find no mention of KMedoids in Scikit-Learn. Me and my friend have implemented the algorithm in Python, and were wondering if it could be brought into Scikit-Learn. Thoughts? Cheers, Timo PS: I am new to the mailing list, so please guide me in case I am doing

Re: [Scikit-learn-general] KMedoids algorithm in Scikit-Learn

2015-07-30 Thread Andreas Mueller
I think KMediods has come up before. One issues is that it doesn't really scale to large n_samples, right? There is an implementation mentioned here: https://github.com/scikit-learn/scikit-learn/issues/3799 Do you use it because you have a custom distance matrix? On 07/30/2015 02:27 PM,

Re: [Scikit-learn-general] Code contribution: Supervised PCA

2015-07-30 Thread Stylianos Kampakis
My feeling is that it will perform better in cases where there are clusters of correlated attributes, which the exact same case where it would make sense to use a dimensionality reduction technique such as factor analysis or PCA. Hastie et al. in their book Elements of Statistical Learning

Re: [Scikit-learn-general] Code contribution: Supervised PCA

2015-07-30 Thread Stylianos Kampakis
Hi Sebastian, LDA is unsupervised. Supervised PCA finds components correlated with the response variable. Best regards, Stelios 2015-07-29 22:55 GMT+01:00 Sebastian Raschka se.rasc...@gmail.com: Out of curiosity, how does supervised PCA compare to LDA (Linear Discriminant Analysis); in a

Re: [Scikit-learn-general] Code contribution: Supervised PCA

2015-07-30 Thread Mathieu Blondel
He was asking about Linear Discriminant Analysis, not Latent Dirichlet Allocation. Mathieu On Thu, Jul 30, 2015 at 7:58 PM, Stylianos Kampakis stylianos.kampa...@gmail.com wrote: Hi Sebastian, LDA is unsupervised. Supervised PCA finds components correlated with the response variable.

[Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Maheshakya Wijewardena
Hi, I've started to look into the matter of improving performance of LSHForest. As we have discussed sometime before(in fact, quite a long time), main concern is to Cythonize distance calculations. Currently, this done by iteratively moving over all the query vectors when `kneighbors` method is

Re: [Scikit-learn-general] Code contribution: Supervised PCA

2015-07-30 Thread Stylianos Kampakis
Sorry, my fault. Supervised PCA is different to Linear Discriminant Analysis. It uses a heuristic to keep only the variables that show some correlation with the response when calculating the components. It does not incorporate explicitly the class separation as an objective. Supervised PCA can be