Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman
Oh. Silly mistake. Doesn't break with the correct patch, now at PR#4604... On 16 April 2015 at 14:24, Joel Nothman wrote: > Except apparently that commit breaks the code... Maybe I've misunderstood > something :( > > On 16 April 2015 at 14:18, Joel Nothman wrote: > >> ball tree is not vectorize

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman
Except apparently that commit breaks the code... Maybe I've misunderstood something :( On 16 April 2015 at 14:18, Joel Nothman wrote: > ball tree is not vectorized in the sense of SIMD, but there is > Python/numpy overhead in LSHForest that is not present in ball tree. > > I think one of the pro

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman
ball tree is not vectorized in the sense of SIMD, but there is Python/numpy overhead in LSHForest that is not present in ball tree. I think one of the problems is the high n_candidates relative to the n_neighbors. This really increases the search time. Once we're dealing with large enough index a

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Maheshakya Wijewardena
Moreover, this drawback occurs because LSHForest does not vectorize multiple queries as in 'ball_tree' or any other method. This slows the exact neighbor distance calculation down significantly after approximation. This will not be a problem if queries are for individual points. Unfortunately, form

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Daniel Vainsencher
LHSForest is not intended for dimensions at which exact methods work well, nor for tiny datasets. Try d>500, n_points>10, I don't remember the switchover point. The documentation should make this clear, but unfortunately I don't see that it does. On Apr 15, 2015 7:08 PM, "Joel Nothman" wrote:

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman
I agree this is disappointing, and we need to work on making LSHForest faster. Portions should probably be coded in Cython, for instance, as the current implementation is a bit circuitous in order to work in numpy. PRs are welcome. LSHForest could use parallelism to be faster, but so can (and will

Re: [Scikit-learn-general] Robust PCA

2015-04-15 Thread Yogesh Karpate
Couple of months back, I tried to use following https://github.com/shriphani/robust_pcp/blob/master/robust_pcp.py But I could not install pypropack develope by Jake Vanderplas So I used randomized_svd from Scikitlearn instead of svdp in the code mentioned above. It worked "OK" for me. On Wed, Apr

Re: [Scikit-learn-general] Robust PCA

2015-04-15 Thread Kyle Kastner
IF it was in scipy would it be backported to the older versions? How would we handle that? On Wed, Apr 15, 2015 at 3:40 PM, Olivier Grisel wrote: > We could use PyPROPACK if it was contributed upstream in scipy ;) > > I know that some scipy maintainers don't appreciate arpack much and > would lik

Re: [Scikit-learn-general] Robust PCA

2015-04-15 Thread Olivier Grisel
We could use PyPROPACK if it was contributed upstream in scipy ;) I know that some scipy maintainers don't appreciate arpack much and would like to see it replaced (or at least completed with propack). -- Olivier -- BPM

Re: [Scikit-learn-general] Scikit-learn-general Digest, Vol 63, Issue 34

2015-04-15 Thread Alex Papanicolaou
Kyle & Andreas, Here is my github repo: https://github.com/apapanico/RPCA Responses: 1. I didn't make the GSoC suggestion a few years (also not a student anymore :-(, just using RPCA for work), I just came across it in a google search when trying to find python implementations. 2. As for GoDec, I

Re: [Scikit-learn-general] Robust PCA

2015-04-15 Thread Kyle Kastner
Did you look at GoDec at all? At least when I checked it was more scalable. My bad implementations translated from MATLAB are here: http://kastnerkyle.github.io/blog/2014/03/05/matrix-decomposition/ As far as PROPACK goes - what are the minimal methods we would need to port? I don't know that we w

Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn?

2015-04-15 Thread Andreas Mueller
Hi. Yes, run "make latexpdf" in the "doc" folder. Best, Andy On 04/15/2015 01:11 PM, Tim wrote: > Thanks, Andy! > > How do you generate the pdf file? Can I also do that? > > > On Wed, 4/15/15, Andreas Mueller wrote: > > Subject: Re: [Scikit-learn-g

Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn?

2015-04-15 Thread Tim
Thanks, Andy! How do you generate the pdf file? Can I also do that? On Wed, 4/15/15, Andreas Mueller wrote: Subject: Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn? To: scikit-learn-general@lists.sourcef

Re: [Scikit-learn-general] Robust PCA

2015-04-15 Thread Andreas Mueller
Hi Alex. Thanks for that :) It would be great if you could publish your version to github. We probably can't use PyPROPACK in scikit-learn. The GSoC application period is just over, so you'd have to wait till next year to do that. Cheers, Andy On 04/15/2015 12:53 PM, Alex wrote: Hi Andreas,

Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn?

2015-04-15 Thread Andreas Mueller
Hi Tim. There are pdfs for 0.16.0 and 0.16.1 up now at http://sourceforge.net/projects/scikit-learn/files/documentation/ Let us know if there are issues with it. Cheers, Andy On 04/15/2015 12:08 PM, Tim wrote: > Hello, > > I am looking for a pdf file for the documentation for the latest stable

Re: [Scikit-learn-general] Robust PCA

2015-04-15 Thread Alex
Hi Andreas, I have an implementation of the ALM method for Robust PCA from Candes using Jake Vanderplas' PyPROPACK. It's in a private bitbucket repo but I will move it to github and send the link if you like. I actually really wanted to contribute RPCA to sklearn. I don't know about a PR but

Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2015-04-15 Thread Satrajit Ghosh
hi andy and dan, i've been using a similar heuristic with extra trees quite effectively. i have to look at the details of this R package and the paper, but in my case i add a feature that has very low correlation with my target class/value (depending on the problem) and choose features that have a

Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2015-04-15 Thread Daniel Homola
Hi Andy, So at each iteration the x predictor matrix (n by m) is practically copied and each column is shuffled in the copied version. This shuffled matrix is then copied next to the original (n by 2m) and fed into the RF, to get the feature importances. Also at the start of the method, a vect

[Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn?

2015-04-15 Thread Tim
Hello, I am looking for a pdf file for the documentation for the latest stable scikit-learn i.e. 0.16.1. I followed http://scikit-learn.org/stable/support.html#documentation-resources, which leads me to http://sourceforge.net/projects/scikit-learn/files/documentation/, But the pdf files are f

Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2015-04-15 Thread Andreas Mueller
Hi Dan. I saw that paper, but it is not well-cited. My question is more how different this is from what we already have. So it looks like some (5) random control features are added and the features importances are compared against the control. The question is whether the feature importance that

Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2015-04-15 Thread Daniel Homola
Hi Andy, This is the paper: http://www.jstatsoft.org/v36/i11/ which was cited 79 times according to Google Scholar. Regarding your second point, the first 3 questions of the FAQ on the Boruta website answers it I guess.. https://m2.icm.edu.pl/boruta/ 1. *So, what's so special about Boruta?*

Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2015-04-15 Thread Andreas Mueller
Hi Daniel. That sounds potentially interesting. Is there a widely cited paper for this? I didn't read the paper, but it looks very similar to RFE(RandomForestClassifier()). Is it qualitatively different from that? Does it use a different feature importance? btw: your mail is flagged as spam as

[Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Miroslav Batchkarov
Hi everyone, was really impressed by the speedups provided by LSHForest compared to brute-force search. Out of curiosity, I compared LSRForest to the existing ball tree implementation. The approximate algorithm is consistently slower (see below). Is this normal and should it be mentioned in the

Re: [Scikit-learn-general] Robust PCA

2015-04-15 Thread Kyle Kastner
Robust PCA is awesome - I would definitely like to see a good and fast version. I had a version once upon a time, but it was neither good *or* fast :) On Wed, Apr 15, 2015 at 10:33 AM, Andreas Mueller wrote: > Hey all. > Was there some plan to add Robust PCA at some point? I vaguely remember > a

[Scikit-learn-general] Robust PCA

2015-04-15 Thread Andreas Mueller
Hey all. Was there some plan to add Robust PCA at some point? I vaguely remember a PR, but maybe I'm making things up. It sounds like a pretty cool model and is widely used: Sparse http://statweb.stanford.edu/~candes/papers/RobustPCA.pdf [and I was just promised a good implementation] Andy

Re: [Scikit-learn-general] pydata

2015-04-15 Thread Andreas Mueller
PyData London is soon, not sure the date is official. It's end of June, I think. In NYC I think I'm talking at a Python meetup at April 23rd. On 04/14/2015 06:05 PM, Pagliari, Roberto wrote: Is there a pydata or sklearn workshop coming up in NYC or London? Thank you, --

[Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2015-04-15 Thread Daniel Homola
Hi all, I needed a multivariate feature selection method for my work. As I'm working with biological/medical data, where n < p or even n << p I started to read up on Random Forest based methods, as in my limited understanding RF copes pretty well with this suboptimal situation. I came across