Re: [Scikit-learn-general] Random forest benchmarks: wise.io vs. sklearn

2012-11-28 Thread Gilles Louppe
Do they use the same value for the min_samples_split parameter? I see they use a default value (hidden in their constructor I guess), but theirs might not be the same as ours. Gilles On 28 November 2012 16:29, Andreas Mueller amuel...@ais.uni-bonn.de wrote: Am 28.11.2012 16:19, schrieb Peter

Re: [Scikit-learn-general] Random forest benchmarks: wise.io vs. sklearn

2012-11-28 Thread Gilles Louppe
Nope they don't... On 28 November 2012 16:39, Andreas Mueller amuel...@ais.uni-bonn.de wrote: Am 28.11.2012 16:33, schrieb Gilles Louppe: Do they use the same value for the min_samples_split parameter? I see they use a default value (hidden in their constructor I guess), but theirs might

Re: [Scikit-learn-general] Problem unpickling 0.11 RF model in 0.12/0.13

2012-11-20 Thread Gilles Louppe
Thanks a lot for the quick responses and the suggestions. Unfortunately, rebuilding the model every time a new version comes out is not an option for me. Well then, from a very practical point of view, do you need to upgrade at all? Your model won't be any more accurate because you update.

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-17 Thread Gilles Louppe
For Trees, you could subsample and train trees on different subsets but not sure how well this works if the subsets are only a small fraction of the whole dataset. This often works surprisingly well :) (both along examples and features)

Re: [Scikit-learn-general] RF optimisation - class weights etc.

2012-11-06 Thread Gilles Louppe
Hi Paul, a) Scaling has no effect on decision trees. b) You shouldn't set max_depth=5. Instead, build fully developed trees (max_depth=None) or rather tune min_samples_split using cross-validation. Hope this helps. Gilles On 6 November 2012 16:21, paul.czodrow...@merckgroup.com wrote: ear

Re: [Scikit-learn-general] random forest question

2012-10-27 Thread Gilles Louppe
Hi, I know the speaker at pydata today claimed that the features are partitioned, Can you elaborate? If you pick your features prior to the construction of the tree and then build it on that subset only, then indeed, this is not random forest. That algorithm is called Random Subspaces. Best,

Re: [Scikit-learn-general] User defined classifier in Ensemble Learning

2012-10-16 Thread Gilles Louppe
Hi Siddhant, This is not yet supported unfortunately. Best, Gilles On 15 October 2012 17:50, Siddhant Goel siddhantg...@gmail.com wrote: Hi people, Does scikit-learn support plugging in user defined classifiers in its ensemble learning framework? I went through the documentation but could

[Scikit-learn-general] Teaching materials

2012-10-01 Thread Gilles Louppe
Hi Team, Given the increasing maturity of the project, we have decided (or, more precisely, I convinced my advisor :-)) to use Scikit-Learn in the machine learning course given at my university. Our objective is to make our students use Scikit-Learn for three assignments. We were previously using

Re: [Scikit-learn-general] Classifying where some labels are not in dataset

2012-09-26 Thread Gilles Louppe
Hi, The ensemble classes handle the problem you describe already. Have a look at the implementation of predict_proba of BaseForestClassifier in ensemble.py if you want to do that yourself by hand. Hope this helps. Gilles On Wednesday, 26 September 2012, Mathieu Blondel math...@mblondel.org

Re: [Scikit-learn-general] Classifying where some labels are not in dataset

2012-09-26 Thread Gilles Louppe
@Doug: Sorry I was typing my previous response from my phone. The snippet of code that I was talking about can be found at: https://github.com/glouppe/scikit-learn/blob/master/sklearn/ensemble/forest.py#L93 Cheers, Gilles On Wednesday, 26 September 2012, Gilles Louppe g.lou...@gmail.com wrote

Re: [Scikit-learn-general] Classifying where some labels are not in dataset

2012-09-26 Thread Gilles Louppe
I'm basically looking to take pre-trained classifiers and allows you to combine the predicted probabilities in custom ways, like favoring some classifiers over others, etc. Not that RandomForests™ are not useful--they could be the building block classifiers in such a system. @Oliver's

Re: [Scikit-learn-general] link between a classifier's score and fit methods

2012-09-21 Thread Gilles Louppe
Hi Christian, The score method does not play any role in fit. Are you sure the RF classifier is the same in both case? (have you set the random state to the same value?) Can you provide some code in any case? Thanks, Gilles On 21 September 2012 20:45, Christian Jauvin cjau...@gmail.com

Re: [Scikit-learn-general] Issue tags on github

2012-09-02 Thread Gilles Louppe
+1 On 2 September 2012 14:16, Alexandre Gramfort alexandre.gramf...@inria.fr wrote: sounds good to me especially since you volunteer to do it :) Alex On Sun, Sep 2, 2012 at 2:10 PM, Andreas Mueller amuel...@ais.uni-bonn.de wrote: Hey everybody. I noticed in the last couple of month that

Re: [Scikit-learn-general] wiserf vs. sklearn RF

2012-08-27 Thread Gilles Louppe
Hi Peter, At least we are better than Weka! More seriously, this indeed shows that there is still a lot of work to do... :( Gilles On 27 August 2012 09:06, Peter Prettenhofer peter.prettenho...@gmail.com wrote: Hi folks, I just stumbled upon this benchmark comparing wiserf, R randomForest,

Re: [Scikit-learn-general] Release 0.12 schedule

2012-08-01 Thread Gilles Louppe
Hi, I am indeed leaving for holiday very soon and will be disconnected until mid-Augustus. My personal wish list is short: - #986: A full lazy argsort implementation of the tree construction algorithm. - #941: Tree post-pruning I plan to work on both at my return. #941 shouldn't take much time,

Re: [Scikit-learn-general] Unable to call fit() on random forest classifier when it is encapsulated in separate class

2012-07-19 Thread Gilles Louppe
Hi, What version of scikit-learn are you using? 0.11 or dev? Best, Gilles On 19 July 2012 06:34, Shankar Satish mailsh...@yahoo.co.in wrote: Hello everyone, I have a custom prediction class which in fact consists of a random forest regressor+classifier. The class implements a fit() method,

Re: [Scikit-learn-general] euro scipy

2012-03-30 Thread Gilles Louppe
Since it's in Brussels, I think I should be there as well :) I can also help with something around scikit-learn if needed. Gilles On 30 March 2012 10:31, Vincent Michel vm.mic...@gmail.com wrote: I think that I will be there too. 2012/3/30 Alexandre Gramfort alexandre.gramf...@inria.fr I

Re: [Scikit-learn-general] covertype benchmark and unexpected extra trees and random forest results

2012-03-27 Thread Gilles Louppe
Hi, I am running the tests again, but indeed I think the difference in the results comes from that fact that max_features=sqrt(n_features) now by default whereas it was max_features=n_features before. Gilles On 27 March 2012 11:56, Paolo Losi paolo.l...@gmail.com wrote: Thanks Peter, On Tue,

Re: [Scikit-learn-general] Default value n_estimators for GBRT

2012-03-25 Thread Gilles Louppe
Hi Olivier, The higher the number of estimators, the better. The more random the trees (e.g., the lower max_features), the more important it usually is to have a large forest to decrease the variance. To me, 10 is actually a very low default value. In my daily research, I deal with hundreds of

Re: [Scikit-learn-general] extra trees

2012-03-25 Thread Gilles Louppe
Hi Satrajit, Adding more trees should never hurt accuracy. The more, the better. Since you have a lot of irrelevant features, I'll advise to increase max_features in order to capture the relevant features when computing the random splits. Otherwise, your trees will indeed fit on noise. Best,

Re: [Scikit-learn-general] GridSearch

2012-02-03 Thread Gilles Louppe
Hi, You can inject your fit params using the `fit_params` parameter in GridSearchCV. Gilles On 3 February 2012 13:59, Mathias Verbeke mathi...@gmail.com wrote: Hi Andreas, You would have to add it to the fit method of SVC, not GridSearchCV. How can this be done in the digits example,

Re: [Scikit-learn-general] Ensemble meta-estimators

2012-01-20 Thread Gilles Louppe
Yes indeed, as I said at the time, much of the forest code could be reused to implement a pure averaging meta-estimator. The main thing that makes BaseForest tree-specific is that it precomputes X_argsorted such that it is computed only once for all trees and inject it into the fit method of the

Re: [Scikit-learn-general] Ensemble meta-estimators

2012-01-20 Thread Gilles Louppe
Yep, I think that your solution would work Olivier. I am buzy this week-end but I can push a first draft of this refactoring by the beginning of next week. Gilles On Saturday, 21 January 2012, Olivier Grisel olivier.gri...@ensta.org wrote: 2012/1/20 Andreas amuel...@ais.uni-bonn.de: On

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-10 Thread Gilles Louppe
It is converted to Fortran order for efficiency reasons. The most repeated and consuming operation is the search for split thresholds, which is performed column-wise, hence the Fortran ordering. Gilles On 10 January 2012 09:39, Andreas amuel...@ais.uni-bonn.de wrote: Hey everybody. Looking a

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-10 Thread Gilles Louppe
Well, not everyone is using modern architectures ;) On 10 January 2012 10:43, Andreas amuel...@ais.uni-bonn.de wrote: On 01/10/2012 10:22 AM, Gilles Louppe wrote: @both: This might be a stupid question but is there really so much difference in indexing continuously or with stride over a C

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-10 Thread Gilles Louppe
The current code works great for me (thanks for contributing), still it would mean a lot if I could make it even faster. At the moment it takes me about 8 hours to grow a tree with only a subset of the features that I actually want to use I have a 128 core cluster here but then

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-03 Thread Gilles Louppe
Hi Andras, Try setting min_split=10 or higher. With a dataset of that size, there is no point in using min_split=1, you will 1) consume indeed too much memory and 2) overfit. Gilles PS: I have just started to change to doc. Expect a PR later today :) On 3 January 2012 09:27, Andreas

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-03 Thread Gilles Louppe
. Thanks! Will try that. Also thanks for working on the docs! :) Cheers, Andy On 01/03/2012 09:30 AM, Gilles Louppe wrote: Hi Andras, Try setting min_split=10 or higher. With a dataset of that size, there is no point in using min_split=1, you will 1) consume indeed too much memory and 2

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-02 Thread Gilles Louppe
Hi Andy! 1) The narrative docs say that max_features=n_features is a good value for RandomForests. As far as I know, Breiman 2001 suggests max_features = log_2(n_features). I also saw a claim that Breiman 2001 suggests max_features = sqrt(n_features) but I couldn't find that in the paper.

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-02 Thread Gilles Louppe
The narrative docs say that max_features=n_features is a good value for RandomForests. As far as I know, Breiman 2001 suggests max_features = log_2(n_features). I also saw a claim that Breiman 2001 suggests max_features = sqrt(n_features) but I couldn't find that in the paper. I just

[Scikit-learn-general] Parallel forest: call for review

2011-12-30 Thread Gilles Louppe
Hi list, This is a call to get an additional person (or more) to review the pending PR #491 on parallel forest of trees. It has already been reviewed by @ogrisel and look ready to merged for the both of us, but an additional review would be more than welcome!

Re: [Scikit-learn-general] Plotting training to evaluate the bias / variance regime

2011-12-30 Thread Gilles Louppe
It seems to be an interesting tool to me. We need to find a non-trivial overfitting example that would run in an acceptable time with the datasets available in the scikit. Actually, those curves can be plot with respect to any parameter, not only the training set size. What comes to me is to

Re: [Scikit-learn-general] Pull requests: WIP/MRG convention

2011-12-18 Thread Gilles Louppe
I suggest that we use the following conventions:  * PRs that are not ready to be merged should be named 'WIP: ...' (for   'Work In Progress')  * PRs that are ready to be merged, or more accurately, for which the   contributors feel that they are ready to be merged, should be renamed   to

Re: [Scikit-learn-general] Sprint planning

2011-12-12 Thread Gilles Louppe
Hi list, During the sprint, I plan to review @pprett pull request on Gradient Tree Boosting. It is also my intention to implement parallel construction and prediction of forest of trees. I also have some ideas concerning the tree module, like computing variable importance (which is already

Re: [Scikit-learn-general] December sprint planning (NIPS edition)

2011-12-12 Thread Gilles Louppe
Hi Gael, Actually, it would be great if everybody who showed interest in the past, or who is now interested could send me an email so that I have a clear view of who is coming when, to make the bookings. Due to my changes of plans, I will arrive at NIPS on Thursday 15. I will come at the

[Scikit-learn-general] Builtbot out of service?

2011-11-24 Thread Gilles Louppe
Hi list, Just to let the admins know. It's been a few days I have been trying to access our buildbot web page (http://buildbot.afpy.org/scikit-learn/) but the service seems to be always unavailable. Gilles -- All the

Re: [Scikit-learn-general] Builtbot out of service?

2011-11-24 Thread Gilles Louppe
Good job Nelle! Thank you :) Gilles On 24 November 2011 21:17, Olivier Grisel olivier.gri...@ensta.org wrote: 2011/11/24 Nelle Varoquaux nelle.varoqu...@gmail.com: The buildbot is back online ! Thanks! -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel

[Scikit-learn-general] Forests of randomized trees pull request

2011-11-18 Thread Gilles Louppe
Hi list, I would like to ask for comments on the forests of randomized trees pull request that I have been working on for the past few weeks. I think it is ready for merge. This pull request is the first in scikit-learn to concern ensemble methods and includes two important tree-based algorithms

Re: [Scikit-learn-general] Can't find sklearn.manifold in the Python Module Index

2011-11-10 Thread Gilles Louppe
Upgrading sphinx seems to solve the problem of missing docstring reference for functions, it should be already in the webpage. Great! Gilles -- RSA(R) Conference 2012 Save $700 by Nov 18 Register now

Re: [Scikit-learn-general] December sprint planning (NIPS edition)

2011-11-07 Thread Gilles Louppe
Booking a guest house is a great idea! But do you intend to book such a house for NIPS and the sprint, or for the sprint only? In particular, I was concerned about commuting to the Sierra Nevada during the workshops if the house was in Granada. GIlles On 5 November 2011 19:05, Olivier Grisel

Re: [Scikit-learn-general] December sprint planning (NIPS edition)

2011-11-07 Thread Gilles Louppe
What I have in mind is to have the house for NIPS and for the sprint, but to have a gap in between during the workshop. We are going to call them today, so if you want in for one or both of the periods, please keep us posted. I am in, for both periods! Thanks Gilles

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-11-04 Thread Gilles Louppe
I have just submitted a PR to brian's branch :) On 4 November 2011 11:13, Peter Prettenhofer peter.prettenho...@gmail.com wrote: Gilles, I was not aware of your work in _tree.pyx. Looks great! Still, I didn't touch any line in `find_best_split` so the merging/rebase should be quite

Re: [Scikit-learn-general] SVM multi-class classification weights

2011-11-04 Thread Gilles Louppe
ranks = np.argsort(np.sum(estimator.coef_ ** 2, axis=0)) My question is: Why the summation of the squared weight matrix is used? What is the logic behind it? This is used for handling estimators that assign several weights to the same feature. Indeed, if several weights are assigned to a each

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Gilles Louppe
   import cPickle    for i in range(0, 20):        with open(forest%d.pkl % (i), 'r') as f:            start = datetime.now()            a = cPickle.load(f)            print 'loaded ', i, datetime.now() - start produce these run-time results loaded  0 0:00:14.952436 loaded  1

<    1   2