Re: [Scikit-learn-general] Speed up Random Forest/ Extra Trees tuning

2016-03-22 Thread Gilles Louppe
Unfortunately, the most important parameters to adjust to maximize accuracy are often those controlling the randomness in the algorithm, i.e. max_features for which this strategy is not possible. That being said, in the case of boosting, I think this strategy would be worth automatizing, e.g. to

Re: [Scikit-learn-general] Scikit Learn, Tree, new criterion

2016-03-16 Thread Gilles Louppe
Hi Eskil, (CC: the scikit-learn mailing list) Unfortunately, I would not have time myself to implement this new criterion. In any case, given the recent publication of this paper, I dont think we would add it to the scikit-learn codebase. Our policy is to only include time-tested algorithms.

Re: [Scikit-learn-general] Gaussian process predict method issue

2016-01-05 Thread Gilles Louppe
Hi, Before going further, what version of scikit-learn are you using? We did a major update of the GP code in 0.18-dev. Best, Gilles On 6 January 2016 at 05:01, Zafer Leylek wrote: > Just going over the scikit GaussianProcess code and comparing the results >

Re: [Scikit-learn-general] Jeff Levesque: '.predict_proba()' me tho for smaller datasets

2015-12-06 Thread Gilles Louppe
Hi Jeff, In general, most implementations of predict_proba are some proxy the conditional probability p(y|x). Some of them really are modelling this quantity quite well (e.g., gaussian process) while for some others it is closer to a heuristic than to the actual p(y|x) (e.g., with linear models).

Re: [Scikit-learn-general] RandomForestRegressor max_features default

2015-11-13 Thread Gilles Louppe
Hi Sebastian, Yes. This is intentional. The motivation comes from http://link.springer.com/article/10.1007/s10994-006-6226-1#/page-1 where it is shown experimentally that is a good default value on average. Gilles On 13 November 2015 at 11:17, Sebastian Raschka wrote: >

Re: [Scikit-learn-general] scikit-learn 0.17b1 is out!

2015-10-17 Thread Gilles Louppe
Congratulations! I wish I could be there next week to offer you beers :( Gilles On 17 October 2015 at 18:51, Gael Varoquaux wrote: > Thanks a lot to the team that pulled out this beta release. I know that > it was a lot of work with a huge amount of bug fixing.

Re: [Scikit-learn-general] new commiters

2015-09-23 Thread Gilles Louppe
Welcome to both of you Tom and Jan! On 23 September 2015 at 07:45, Jan Hendrik Metzen wrote: > Hi everyone, > thanks a lot; I am glad to be part of such a great team and looking > forward to continue to work with you guys! > > Cheers, > Jan > > On 22.09.2015 19:16,

Re: [Scikit-learn-general] Preparing the 0.17 release

2015-09-21 Thread Gilles Louppe
Hi Olivier, It seems the 3 PRs you mentioned are now closed/merged. Are there other blocking PRs you need us to look at before freezing for the release? Cheers, Gilles On 4 September 2015 at 12:16, Olivier Grisel wrote: > Hi all, > > It's been a while since we have

Re: [Scikit-learn-general] What is the best way to migrate existing scikit-learn code to PySpark cluster to do scalable machine learning?

2015-09-12 Thread Gilles Louppe
Hi, > But the question is how to make the scikit-learn code, decisionTree Regressor > for example, running in distributed computing mode, to benefit the power of > Spark? I am sorry but you cant. The tree implementation in scikit-learn was not designed for this use case. Maybe you should have

Re: [Scikit-learn-general] DecisionTree: How to split categorical features into two subsets instead of a single value and the rest?

2015-09-12 Thread Gilles Louppe
Hi Rex, This is currently not supported in scikit-learn. Gilles On 12 September 2015 at 05:02, Rex X wrote: > Given categorical attributes, for instance > city = ['a', 'b', 'c', 'd', 'e', 'f'] > > With DictVectorizer(), we can transform "city" into a sparse matrix, using >

Re: [Scikit-learn-general] Is there any attribute saying the number of samples of each class in one decision tree node?

2015-08-31 Thread Gilles Louppe
Here is a sample code on how to retrieve the nodes traversed by a given sample: from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import load_iris iris = load_iris() X, y = iris.data, iris.target clf = DecisionTreeClassifier().fit(X, y) def path(tree, sample): nodes =

Re: [Scikit-learn-general] Is there any attribute saying the number of samples of each class in one decision tree node?

2015-08-31 Thread Gilles Louppe
Also, have a look at the documentation here https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L3205 to understand the structure of the tree_ object. On 31 August 2015 at 08:55, Gilles Louppe <g.lou...@gmail.com> wrote: > Here is a sample code on how to

Re: [Scikit-learn-general] Is there any attribute saying the number of samples of each class in one decision tree node?

2015-08-30 Thread Gilles Louppe
Hi, The simplest method to get you are looking for is to re-propagate the training samples into the tree and keep track of the nodes they traverse. You should have a look at the implementation of `apply` to get started. Hope this helps, Gilles On 30 August 2015 at 21:55, Rex X dnsr...@gmail.com

Re: [Scikit-learn-general] Is there any attribute saying the number of samples of each class in one decision tree node?

2015-08-30 Thread Gilles Louppe
(Also, this can be done in Python code, by using the interface we provide for the tree_ object) On 30 August 2015 at 22:22, Gilles Louppe g.lou...@gmail.com wrote: Hi, The simplest method to get you are looking for is to re-propagate the training samples into the tree and keep track

Re: [Scikit-learn-general] Number of subsamples in Random Forest

2015-07-09 Thread Gilles Louppe
Hi Sebastian, Indeed, N samples are drawn with replacement, where N=len(original training set). I guess we could add an extra max_samples parameter, just like we have for the Bagging estimators. Gilles On 6 July 2015 at 23:00, Sebastian Raschka se.rasc...@gmail.com wrote: Thanks, Jeff, that

Re: [Scikit-learn-general] Decsion tree regression -- mean squared error or variance reduction

2015-07-09 Thread Gilles Louppe
Hi Sebastian, Both terminology are in fact strictly equivalent for regression. See e.g. page 46 of http://arxiv.org/abs/1407.7502 Best, Gilles On 9 July 2015 at 18:56, Sebastian Raschka se.rasc...@gmail.com wrote: Hi, all, sorry, but I have another question regarding the terminology in the

Re: [Scikit-learn-general] Finding a corresponding leaf node for each data point in a decision tree

2015-05-24 Thread Gilles Louppe
Hi, Since the last version, scikit-learn provides an `apply` method for the classifier itself, hence preventing users from shooting themselves in the foot :) So basically, you can replace clf.tree_.apply(X_train) with clf..apply(X_train) and it should work. Hope this helps, Gilles On 23 May

Re: [Scikit-learn-general] Use of the 'learn' font in third party packages

2015-04-28 Thread Gilles Louppe
Hi Trevor, I am only speaking for myself, not on behalf of the scikit-learn project, but I would be +1 for your project and use of the -learn suffix. The pros you cite are in my opinion more important than the cons. Cheers, Gilles On 28 April 2015 at 05:33, Trevor Stephens

Re: [Scikit-learn-general] random forest importance and correlated variables.

2015-04-19 Thread Gilles Louppe
Hi Luca, If you want to find all relevant features, I would recommend using ExtraTreesClassifier with max_features=1 and limited depth in order to avoid this kind of bias due to estimation errors. E.g., try with max_depth=3 to 5 or using max_leaf_nodes. Hope this helps, Gilles On 19 April

Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2015-04-17 Thread Gilles Louppe
Hi, In general, I agree that we should at least add a way to compute feature importances using permutations. This is an alternative, yet standard, way to do it in comparison to what we do (mean decrease of impurity, which is also standard). Assuming we provide permutation importances as a

Re: [Scikit-learn-general] [ANN] scikit-learn 0.16.0 is out!

2015-03-27 Thread Gilles Louppe
Congratulations to everyone involved! Kudos to Andy, Olivier and Joel for their continuous work these last months :) On 27 March 2015 at 19:01, Alexandre Gramfort alexandre.gramf...@telecom-paristech.fr wrote: :beers: ! A

Re: [Scikit-learn-general] My personal suggestion regarding topics for GSoC

2015-03-06 Thread Gilles Louppe
Hi Luca, On 6 March 2015 at 11:09, Luca Puggini lucapug...@gmail.com wrote: Hi, It seems to me that you are discussing topics that can be introduced in sklearn with GSoC. I use sklearn quiet a lot and there are a couple of things that I really miss in this library: 1- Nipals PCA. The

Re: [Scikit-learn-general] My personal suggestion regarding topics for GSoC

2015-03-06 Thread Gilles Louppe
Yes, in fact I did something similar in my thesis. See section 7.2 for a discussion about this. Figure 7.5 is similar to what you describe in your sample code. By varying the depth, you can basically control the bias. http://orbi.ulg.ac.be/bitstream/2268/170309/1/thesis.pdf On 6 March 2015 at

Re: [Scikit-learn-general] Score function in Extra-Trees

2015-02-25 Thread Gilles Louppe
Hi Pierre, While the name is different, the MSE criterion is strictly equivalent to the reduction of variance. The only difference is that we do not divide by var{y|S} because this factor is the same for all splits and all features, hence the maximizer is the same. Cheers, Gilles On 24

Re: [Scikit-learn-general] Problem with GIL when compiling _tree.pyx from sklearn github

2015-02-22 Thread Gilles Louppe
Thanks for the report. I can indeed reproduce the issue -- _tree.pyx does no longer compile with Cython 0.22. (The current _tree.c code was compiled with an older version of Cython.) On 23 February 2015 at 07:02, Zay Maung Maung Aye zmm...@gmail.com wrote: Hi Everyone, I downloaded the

Re: [Scikit-learn-general] Feature selection and cross validation; and identifying chosen features

2015-02-11 Thread Gilles Louppe
On 11 February 2015 at 22:22, Timothy Vivian-Griffiths vivian-griffith...@cardiff.ac.uk wrote: Hi Gilles, Thank you so much for clearing this up for me. So, am I right in thinking that the feature selection is carried for every CV-fold, and then once the best parameters have been found, the

Re: [Scikit-learn-general] Feature selection and cross validation

2015-02-10 Thread Gilles Louppe
Hi Tim, On 9 February 2015 at 19:54, Timothy Vivian-Griffiths vivian-griffith...@cardiff.ac.uk wrote: Just a quick follow up to some of the previous problems that I have had: after getting some kind assistance at the PyData London meetup last week, I found out why I was getting different

Re: [Scikit-learn-general] Samples per estimator on Random Forests

2014-12-16 Thread Gilles Louppe
Hi Miquel, These options are not available within RandomForestClassifier/Regressor. By default len(X) are drawn with replacement. However, you can achieve what you look for using BaggingClassifier(base_estimator=DecisionTreeClassifier(...), max_samples=..., max_features=...), where max_samples

Re: [Scikit-learn-general] Access data arriving at leaf nodes

2014-10-15 Thread Gilles Louppe
Hi, I confirm what has been said before. Samples are not stored anywhere in the leafs -- only the final prediction along with some statistics. To do what you want, you have to recompute the distribution yourself, eg using apply and then grouping by leaf ids. Gilles On 15 October 2014 02:25,

Re: [Scikit-learn-general] Unpredictability of GradientBoosting

2014-09-16 Thread Gilles Louppe
Hi Deb, In your case, randomness comes from the max_features=6 setting, which makes the model not very stable from one execution to another, since the original dataset includes about 5x more input variables. Gilles On 16 September 2014 12:40, Debanjan Bhattacharyya b.deban...@gmail.com wrote:

Re: [Scikit-learn-general] algorithm used to train the tree with option 'best'

2014-09-12 Thread Gilles Louppe
Hi Luca, The best strategy consists in finding the best threshold, that is the one that maximizes impurity decrease, when trying to partition a node into a left and right nodes. By contrast, random does not look for the best split and simply draw the discretization threshold at random. For

Re: [Scikit-learn-general] algorithm used to train the tree with option 'best'

2014-09-12 Thread Gilles Louppe
Yes, exactly. Le 12 sept. 2014 18:31, Luca Puggini lucapug...@gmail.com a écrit : Hey thanks a lot, so basically in random Forest the split is done like in the algorithm described in your thesis except that the search is not done on all the variables but only on a random subset of them?

Re: [Scikit-learn-general] outlier measure random forest

2014-09-08 Thread Gilles Louppe
Hi Luca, This may not be the fastest implementation, but random forest proximities can be computed quite straightforwardly in Python given our 'apply' function. See for instance https://github.com/glouppe/phd-thesis/blob/master/scripts/ch4_proximity.py#L12 From a personal point of view, I never

Re: [Scikit-learn-general] outlier measure random forest

2014-09-08 Thread Gilles Louppe
I am rather -1 on making this a transform. There has many ways to come up with proximity measures in forest -- In fact, I dont think Breiman's is particularly well designed. On 8 September 2014 16:52, Gael Varoquaux gael.varoqu...@normalesup.org wrote: On Mon, Sep 08, 2014 at 11:49:26PM +0900,

Re: [Scikit-learn-general] outlier measure random forest

2014-09-08 Thread Gilles Louppe
of the two samples. On 8 September 2014 17:03, Mathieu Blondel math...@mblondel.org wrote: On Mon, Sep 8, 2014 at 11:55 PM, Gilles Louppe g.lou...@gmail.com wrote: I am rather -1 on making this a transform. There has many ways to come up with proximity measures in forest -- In fact, I dont

Re: [Scikit-learn-general] Questions on random forests

2014-07-28 Thread Gilles Louppe
Hi Kevin, Interesting question. Your point is true provided you have an infinite amount of training data. In that case, you can indeed show that an infinitely large forest of extremely randomized trees built for K=1 converges towards an optimal model (the Bayes model). This result however does

Re: [Scikit-learn-general] Better precision for class probabilities (predict_proba).

2014-06-03 Thread Gilles Louppe
Hi Pranav, You should increase the number of trees. By default, it is set to 10, which would explain why you don't reach higher precision. Best, Gilles On 3 June 2014 07:32, Pranav O. Sharma emailpra...@gmail.com wrote: Hi, I'm trying to use

Re: [Scikit-learn-general] Better precision for class probabilities (predict_proba).

2014-06-03 Thread Gilles Louppe
] [ 0.4 0.6 ] [ 0.41 0.59] [ 0.65 0.35] [ 0.52 0.48] [ 0.42 0.58] [ 0.49 0.51] [ 0.19 0.81] [ 0.71 0.29] [ 0.24 0.76]] On Mon, Jun 2, 2014 at 11:08 PM, Gilles Louppe g.lou...@gmail.com wrote: Hi Pranav, You should increase the number of trees. By default, it is set

Re: [Scikit-learn-general] Unexpected behavior using numpy.asarray with RandomForestClassifier

2014-05-26 Thread Gilles Louppe
Why do you want to put a random forest in a numpy array in the first place? Best, Gilles On 26 May 2014 13:11, Lars Buitinck larsm...@gmail.com wrote: 2014-05-24 0:28 GMT+02:00 Steven Kearnes skear...@gmail.com: a is a list of the individual DecisionTreeClassifier objects belonging to the

Re: [Scikit-learn-general] My talk was approved for EuroScipy'14

2014-05-23 Thread Gilles Louppe
Hi Lars, Thanks! Oh, I would be interested in seeing them. Could send me the link if you still have them? Thanks, Gilles On 23 May 2014 11:05, Lars Buitinck larsm...@gmail.com wrote: 2014-05-22 8:13 GMT+02:00 Gilles Louppe g.lou...@gmail.com: Just for letting you know, my talk Accelerating

Re: [Scikit-learn-general] My talk was approved for EuroScipy'14

2014-05-23 Thread Gilles Louppe
Thanks! This is really cool! I think I'll try to reproduce some of them and put one or two in my slides. On 23 May 2014 11:29, Lars Buitinck larsm...@gmail.com wrote: 2014-05-23 11:08 GMT+02:00 Gilles Louppe g.lou...@gmail.com: Thanks! Oh, I would be interested in seeing them. Could send me

Re: [Scikit-learn-general] Manual categories/separate classifiers

2014-05-23 Thread Gilles Louppe
Hi Tim, In principles, what you describe exactly corresponds to the decision tree algorithm. You partition the input space into smaller subspaces, on which you recursively build sub-decision trees. In practice however, I would not split things by hand, unless you are interested in discovering

[Scikit-learn-general] My talk was approved for EuroScipy'14

2014-05-22 Thread Gilles Louppe
Hi folks, Just for letting you know, my talk Accelerating Random Forests in Scikit-Learn was approved for EuroScipy'14. Details can be found at https://www.euroscipy.org/2014/schedule/presentation/9/. My slides are far from being ready, but my intention is to present our team efforts on the tree

Re: [Scikit-learn-general] Random forest uses only one core with n_jobs = -1

2014-04-18 Thread Gilles Louppe
Hi, Can you try on 0.15-dev to see it solves your issues? We have changed the backend for parallelizing trees. Gilles On 18 April 2014 23:13, Zygmunt Zając zajac.zygm...@gmail.com wrote: Hi, When I train a random forest, I'd like it to use all the cores. I set n_jobs = -1, but it doesn't

Re: [Scikit-learn-general] normalising/scaling input for SVM or Random Forests

2014-03-15 Thread Gilles Louppe
Hi Satra, In case of Extra-Trees, changing the scale of features might change the result when the transform you apply distorts the original feature space. Drawing a threshold uniformly at random in the original [min;max] interval won't be equivalent to drawing a threshold in [f(min);f(max)] if f

Re: [Scikit-learn-general] GSoC

2014-03-12 Thread Gilles Louppe
On 12 March 2014 13:08, Felipe Eltermann felipe.elterm...@gmail.com wrote: Hello Vamsi, Firstly, regarding the implementation of sparse functions. _tree.pxy is the back end cython code to handle the operations Splitting, Evaluating impurities at nodes and then constructing the tree. That's

Re: [Scikit-learn-general] Negative feature_importances in random forest with sample_weights

2014-02-06 Thread Gilles Louppe
Dear Vincent, On 6 February 2014 17:46, Vincent Arel vincent.a...@gmail.com wrote: Hi all, Gilles Louppe[1] suggests that feature importance in random forest classifiers is calculated using the algorithm of Breiman (1984). I imagine this is the same as formula 10.42 on page 368 of Hastie et

Re: [Scikit-learn-general] Negative feature_importances in random forest with sample_weights

2014-02-06 Thread Gilles Louppe
Vincent, I identified the bug and opened an issue at https://github.com/scikit-learn/scikit-learn/issues/2835 I will try to fix this in the next days. Sorry for the inconvenience. Gilles On 6 February 2014 18:18, Gilles Louppe g.lou...@gmail.com wrote: Dear Vincent, On 6 February 2014 17

Re: [Scikit-learn-general] Combine criterions for building a tree

2014-01-29 Thread Gilles Louppe
Hi Pablo, I am not sure re-implementing a new criterion is what you are looking for. Criteria are made to evaluate the goodness of a split (i.e., a binary partition of the samples in the current node) in terms of impurity with regards to the output variable - not the inputs. What you should do

Re: [Scikit-learn-general] Combine criterions for building a tree

2014-01-29 Thread Gilles Louppe
originally designed to handle categorical variables properly...) Cheers, Pablo On 29 January 2014 20:30, Gilles Louppe g.lou...@gmail.com wrote: Hi Pablo, I am not sure re-implementing a new criterion is what you are looking for. Criteria are made to evaluate the goodness of a split (i.e

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-28 Thread Gilles Louppe
Given our intent to release 1.0 in the next future, I think we should also make it clear in the wiki page that adding more and more algorithms is not exactly the direction in which we are going to. Maybe this is the opportunity to remove some of the old subjects from 2013 and instead add topics

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-23 Thread Gilles Louppe
How much code in our current implementation depends on the data representation? Not much actually. It now basically boils down to simply write a new splitter object. Everything else remains the same. So basically, I would say that it amounts to 300~ lines of Cython (out of the 2300 lines in our

Re: [Scikit-learn-general] Sparse matrix support for Decision tree implementation

2014-01-22 Thread Gilles Louppe
Mathieu, I have no experience with forests on sparse data, nor have I seen much work on the topic. I would be curious to investigate however, there may be problems how which this is useful. I know that Arnaud tried forests on (densified) 20newsgroups and it seems to work well actually. In

Re: [Scikit-learn-general] A poster about scikit-learn at Giga-day

2014-01-17 Thread Gilles Louppe
By the way, if any of you would like to recycle this poster, sources are available at https://github.com/glouppe/talk-sklearn-mloss-nips2013/tree/master/poster On 16 January 2014 16:41, Arnaud Joly a.j...@ulg.ac.be wrote: Hi everyone, There is a local event at my university which is called

Re: [Scikit-learn-general] Custom splitting criterion for decision tree classifier

2014-01-12 Thread Gilles Louppe
Dear Caleb, The current implementation does not allow for that. You can do as suggested by Lars though, if this is practical for you. Gilles On 12 January 2014 16:03, Caleb cloverev...@yahoo.com wrote: Hi all, In the current implementation of the decision tree, data is split according

Re: [Scikit-learn-general] Loading trained classifier

2014-01-07 Thread Gilles Louppe
Dear Adolfo, You could instead use pickle, which will create a single file. Best, Gilles On 7 January 2014 16:49, Adolfo Martinez amarti...@intelimetrica.comwrote: Hello, I have a trained ExtraTreesRegressor saved using joblib.dump (without compress). This creates more than ten thousand

Re: [Scikit-learn-general] Custom function in decision-tree based classifiers

2013-11-07 Thread Gilles Louppe
Hi Thomas, Indeed, gini and entropy are the only supported impurity criteria for classification. I don't think we have plans right now to add others - which one do you have in mind? how feasible would it be to have the option of passing custom function to the tree or forest to use in splitting?

Re: [Scikit-learn-general] release time

2013-11-06 Thread Gilles Louppe
Hi, Thanks for pointing this Andy! I think it would help indeed to set some coarse deadline for the next release. This would help us get motion and get things done. End of december or beginning of January would be best for me. On my side, I don't plan to contribute anything big in the meantime.

Re: [Scikit-learn-general] SVM classifier.

2013-10-18 Thread Gilles Louppe
Hi Nigel, What is the proportion of English versus non-English tweets in your data? It may be the case that your dataset is unbalanced. Gilles On 18 October 2013 09:32, Nigel Legg nigel.l...@gmail.com wrote: I have a set of tweets, and I am trying to use an SVM classifier to class them as

Re: [Scikit-learn-general] Fwd: [Broken] scikit-learn/scikit-learn#4168 (knnd - 6ec6346)

2013-10-16 Thread Gilles Louppe
The branch is now deleted ;) On 17 October 2013 06:35, Robert Layton robertlay...@gmail.com wrote: I know, I'm very sorry. I made a new branch directly from upstream/master, then pushed without checking where that push was going to. Can someone please delete this branch for me? I dare not

Re: [Scikit-learn-general] recommendation systems

2013-10-07 Thread Gilles Louppe
Hi Robert, Unfortunately, algorithms for recommender systems are not planned in scikit-learn in the short or mid-term. I would advise you to look at other libraries that are specifically targeting that problem. In particular, GraphLab (http://graphlab.org/) is among the best libraries for

Re: [Scikit-learn-general] Which scikit-learn contributors share common interests?

2013-09-25 Thread Gilles Louppe
on that file. Jake On Wed, Sep 25, 2013 at 5:19 AM, Gilles Louppe g.lou...@gmail.com wrote: Hi, I have just put together a quick and dirty script that does that. It extracts the number of commits for all developers, for all files on a git directory. It then computes the 3 nearest

Re: [Scikit-learn-general] Which scikit-learn contributors share common interests?

2013-09-25 Thread Gilles Louppe
On 25 September 2013 19:05, Andreas Mueller amuel...@ais.uni-bonn.de wrote: On 09/25/2013 06:44 PM, Olivier Grisel wrote: 2013/9/25 Andreas Mueller amuel...@ais.uni-bonn.de: On 09/25/2013 04:15 PM, Jacob Vanderplas wrote: Very cool! One quick comment: I'd probably normalize the values in the

Re: [Scikit-learn-general] NIPS's Machine Learning Open Source Software workshop

2013-09-16 Thread Gilles Louppe
...@gmail.com wrote: Congrats to both of you - enjoy the skiing (boy, I'm jealous)! 2013/9/5 Gilles Louppe g.lou...@gmail.com Congratulations Gael! Ours is also officially accepted, so you can count on me. Gilles On Thursday, 5 September 2013, Gael Varoquaux gael.varoqu...@normalesup.org

Re: [Scikit-learn-general] Categorical values and decision tree classifier

2013-09-12 Thread Gilles Louppe
Dear Yegle, 1) What does your data represent? Are your features numbers or concepts? In the first case, you should try to build your estimator without encoding anything. In the second case, it might also not be necessary to one-hot encode your categorical features. Try with and without encoding

Re: [Scikit-learn-general] NIPS's Machine Learning Open Source Software workshop

2013-08-28 Thread Gilles Louppe
for further promotion of the project. Gilles On 22 August 2013 16:26, Gilles Louppe g.lou...@gmail.com wrote: Hi, It is more than likely that I will be there this year - given the reviews of our paper, I would be surprised if it was rejected. What sort of talk would you have in mind Nelle? Gilles

Re: [Scikit-learn-general] Can Random Forest Classifer ignore specific fields?

2013-08-13 Thread Gilles Louppe
Hi, As Roland says, this is a Numpy question rather than a scikit-learn question. If you want to ignore specific fields then it indeed amounts to removing the corresponding columns in your X array before feeding it to your estimator. (Note however that Random Forests have the advantages of being

Re: [Scikit-learn-general] Mac OS install problem

2013-08-13 Thread Gilles Louppe
Hi, Please be more specific. What are the error messages? Best, Gilles On 13 August 2013 14:14, MORGANDON G doh...@mac.com wrote: Can someone direct me to the correct place to find help with an installation problem I have on the Mac? I used MacPorts and it said everything went just

Re: [Scikit-learn-general] Mac OS install problem

2013-08-13 Thread Gilles Louppe
wrote: command not found On Aug 13, 2013, at 8:21 PM, Gilles Louppe g.lou...@gmail.com wrote: Hi, Please be more specific. What are the error messages? Best, Gilles On 13 August 2013 14:14, MORGANDON G doh...@mac.com wrote: Can someone direct me to the correct place to find help

Re: [Scikit-learn-general] Mac OS install problem

2013-08-13 Thread Gilles Louppe
' is this good news? Don On Aug 13, 2013, at 8:38 PM, Gilles Louppe g.lou...@gmail.com wrote: What command are you typing? To use scikit-learn, if you have either use a Python shell (ie, using the python command in a terminal) or execute a Python script (using python script.py). Are you familiar

Re: [Scikit-learn-general] Associating a LabelEncoder with a Classifier?

2013-07-18 Thread Gilles Louppe
Hi, I'm well aware I can pickle it, but I would like to avoid having to write 2 files - otherwise I would just write the classes to a text file. You pickle several Python objects using the same file handler. Gilles Lars, Well, I'm confused now, sklearn.__version__ says 0.14-git. Did I

Re: [Scikit-learn-general] Paris Sprint location

2013-07-12 Thread Gilles Louppe
- discuss the with the tree growers guys on how to best parallelize random forest trainings on multi-core without copying the training set in memory - either with threads in joblib and with nogil statements in the inner loops of the (new) cython code - either with shared memory and the

Re: [Scikit-learn-general] conditional inference trees

2013-07-07 Thread Gilles Louppe
Hi Theofilos, That would be great! I think it could easily be done by adding new Criterion classes into the _tree.pyx file. Note however that we are currently refactoring the core tree module. It may be best to wait for it to merged for you to start coding - otherwise you may end up with lots of

Re: [Scikit-learn-general] Bootstrap aggregating

2013-06-21 Thread Gilles Louppe
Hi, Such ensembles are not implemented at the moment. Gilles On 21 June 2013 09:59, Maheshakya Wijewardena pmaheshak...@gmail.com wrote: Hi all, I would like to know whether we have bootstrap aggregating functionality in scikit-learn library. If so, How do I use that? (If it doesn't exist

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Gilles Louppe
Hi, This looks like the dataset from the Amazon challenge currently running on Kaggle. When one-hot-encoded, you end up with rhoughly 15000 binary features, which means that the dense representation requires at least 32000*15000*4 bytes to hold in memory (or even twice as as more depending on

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Gilles Louppe
On 3 June 2013 08:43, Andreas Mueller amuel...@ais.uni-bonn.de wrote: On 06/03/2013 05:19 AM, Joel Nothman wrote: However, in these last two cases, the number of possible splits at a single node is linear in the number of categories. Selecting an arbitrary partition allows exponentially many

Re: [Scikit-learn-general] To standardize is the question ...

2013-06-01 Thread Gilles Louppe
Hi, The main question is, what is your definition of an important variable? Gilles On 1 June 2013 14:22, o m oda...@gmail.com wrote: I've been playing around with Lasso and Lars, but there's something that bothers me about standardization. If I don't standardize to N(0, 1), these procedures

Re: [Scikit-learn-general] greetings; more flexibility in trees

2013-05-23 Thread Gilles Louppe
Hi Ken, I share and understand your concerns about the rigidity of the current implementation. I like using Extremely Randomized Trees, but I'm looking for more flexibility in generating them. In particular, I'd like to be able to specify my own criterion and split finding algorithm. I'm

Re: [Scikit-learn-general] Build infrastructure and pep8

2013-05-01 Thread Gilles Louppe
Thanks for solving the Travis bug :) On 1 May 2013 21:15, Gael Varoquaux gael.varoqu...@normalesup.org wrote: On Wed, May 01, 2013 at 06:19:34PM +0200, Olivier Grisel wrote: I spend a couple of hours fixing the build infrastructures: Wow! Thank you so much. These are well-spent hours. G

Re: [Scikit-learn-general] Distributed RandomForests

2013-04-25 Thread Gilles Louppe
Hi Youssef, Regarding memory usage, you should know that it'll basically blow up if you increase the number of jobs. With the current implementation, you'll need O(n_jobs * |X| * 2) in memory space (where |X| is the size of X, in bytes). That issue stems from the use of joblib which basically

Re: [Scikit-learn-general] Our own Olivier Grisel giving a scipy keynote

2013-04-17 Thread Gilles Louppe
Congratulations are in order :-) On 17 April 2013 08:06, Peter Prettenhofer peter.prettenho...@gmail.comwrote: That's great - congratulations Olivier! Definitely, no pressure ;-) 2013/4/17 Ronnie Ghose ronnie.gh...@gmail.com wow :O congrats On Tue, Apr 16, 2013 at 7:17 PM, Mathieu

Re: [Scikit-learn-general] SO question for the tree growers

2013-04-04 Thread Gilles Louppe
Hi Olivier, There are indeed several ways to get feature importances. As often, there is no strict consensus about what this word means. In our case, we implement the importance as described in [1] (often cited, but unfortunately rarely read...). It is sometimes called gini importance or mean

Re: [Scikit-learn-general] domain of appicability - RandomForest, predict_proba function

2013-03-20 Thread Gilles Louppe
Note that you can get perfect scores (either 0.0 or 1.0) simply be setting n_estimators=1. This is why you should use this measure with caution. On 20 March 2013 15:27, Lars Buitinck l.j.buiti...@uva.nl wrote: 2013/3/20 paul.czodrow...@merckgroup.com I was just about to say that discarding

Re: [Scikit-learn-general] Finding dimentions of faces on an image

2013-03-19 Thread Gilles Louppe
Hi, Short answer: you cant. Longer answer: If you use as training samples the whole images (with faces somewhere in there), then your model is learning to discriminate between your 2 categories, from the whole images, with **no** information about where the faces are actually located. As such,

Re: [Scikit-learn-general] Flat is better than nested: Website edition

2013-03-05 Thread Gilles Louppe
I feel like the About us section on the homepage shouldn't be there. I'd rather put a About link somewhere else than putting this in front on the home page. Also, I would use the space that we now have on the front page to highlight more important aspects of the package. On 5 March 2013 14:46,

Re: [Scikit-learn-general] How to get all rules in a tree by leaf node path

2013-02-27 Thread Gilles Louppe
Hi David, I think you should have a look at sklearn.tree.export_graphviz. It will generate a picture of the tree for you. - Reference: http://scikit-learn.org/dev/modules/generated/sklearn.tree.export_graphviz.html#sklearn.tree.export_graphviz - Example:

Re: [Scikit-learn-general] Weighted and Balanced Random Forests

2013-02-07 Thread Gilles Louppe
Hello, You might achieve what you want by using sample weights when fitting your forest (See the 'sample_weight' parameter). There is also a 'balance_weights' method from the preprocessing module that basically generates sample weights for you, such that classes become balanced.

Re: [Scikit-learn-general] Any interest in the Extreme Learning Machine?

2013-02-05 Thread Gilles Louppe
Hi David, What is a SLFN? Do you have any pointer to a reference paper? Best, Gilles On 5 February 2013 00:51, David Lambert caliband...@gmail.com wrote: Hi, I'm new to the list so please forgive my trespasses... I've nearly completed an implementation of the Extreme Learning Machine

Re: [Scikit-learn-general] User Survey

2013-02-04 Thread Gilles Louppe
Hi Andy, Do we really need to take it now? It still gets new answers everyday (28 new answers since your first post 3 days ago). It doesn't hurt to let it online for a week or so, does it? Just my 2 cents. -- Everyone

Re: [Scikit-learn-general] LOF implementation

2013-01-30 Thread Gilles Louppe
I don't know about Lube and Oil, but we have some Filters in the feature_selection package. HTH, G On 30 January 2013 16:04, Andreas Mueller amuel...@ais.uni-bonn.de wrote: On 01/30/2013 03:59 PM, Brian Holt wrote: Is it any one of these? It might be Local Outlier Factor, as we already

Re: [Scikit-learn-general] ANN: scikit-learn 0.13 released!

2013-01-21 Thread Gilles Louppe
Great job to all of you :) Gilles On 22 January 2013 07:57, Peter Prettenhofer peter.prettenho...@gmail.com wrote: Great work guys - especially Andy - thanks a lot for making this happen! best, Peter 2013/1/22 Gael Varoquaux gael.varoqu...@normalesup.org: On Tue, Jan 22, 2013 at

Re: [Scikit-learn-general] Fwd: Pool Seems Closed Error

2013-01-16 Thread Gilles Louppe
Just to let you know, it is basically useless to grid-search over the n_estimators parameter in your forests. The higher, the better. However, you might try to tune min_sample_split (from 1 to n_features). It is one of the few parameters that will actually lead to any improvement in terms of

Re: [Scikit-learn-general] Sklearn joblib issue with custom analyzer

2013-01-15 Thread Gilles Louppe
Hi, Can you give use the full script that is used to load your model? In that script, have you imported my_analyzer? Best, Gilles On 15 January 2013 13:51, JAGANADH G jagana...@gmail.com wrote: Hi All, I was trying to save and load a model (Text Classificaion with SVM) using joblib. In the

Re: [Scikit-learn-general] set sample weights in Pipeline?

2013-01-10 Thread Gilles Louppe
... or more simply: pipeline.fit(X, y, nb__sample_weight=sample_weight) On 10 January 2013 15:20, Gilles Louppe g.lou...@gmail.com wrote: Hi, I don't know how it interfaces with NLTK's SklearnClassifier, but if you can work your way using only Scikit-Learn for training, then can you pass

Re: [Scikit-learn-general] Size of random forest model

2013-01-08 Thread Gilles Louppe
Hi David, On 9 January 2013 02:14, David Broyles sj.clim...@gmail.com wrote: Hi, I'm pretty new to scikit-learn. I've generated a random forest (classification) of 100 trees using default attributes. My data set has over 2M examples. 2 questions: 1) I've noticed the size of the pickled

Re: [Scikit-learn-general] Generalized Cross-Validation API

2012-12-25 Thread Gilles Louppe
Hi Andreas! ... and Merry Christmas to all! Quick and naive question: what is the point in cross-validating the number of trees in RandomForest (or in Extra-Trees)? The rule simple is simple: the more, the better. Gilles On 25 December 2012 13:07, Andreas Mueller amuel...@ais.uni-bonn.de

Re: [Scikit-learn-general] Generalized Cross-Validation API

2012-12-25 Thread Gilles Louppe
. Thanks, Gilles On Tuesday, 25 December 2012, Gilles Louppe g.lou...@gmail.com wrote: Hi Andreas! ... and Merry Christmas to all! Quick and naive question: what is the point in cross-validating the number of trees in RandomForest (or in Extra-Trees)? The rule simple is simple: the more

Re: [Scikit-learn-general] Shape of classes_ varies?

2012-11-29 Thread Gilles Louppe
Hi, Yes, since decision trees handle multi-output problems, classes_[i] is an array containing the classes for the i-th output. Hence classes_[0] is the array you are looking for when `y` is 1D. I guess we could transform classes_ directly into that array if the decision tree is trained on a

Re: [Scikit-learn-general] Shape of classes_ varies?

2012-11-29 Thread Gilles Louppe
we could have a method ``supports_multi_output`` that returns a boolean so we know what shape the classes_ are given some arbitrary clf? Or just introspect it? Doug On Thu, Nov 29, 2012 at 9:57 AM, Gilles Louppe g.lou...@gmail.com wrote: Hi, Yes, since decision trees handle multi-output

Re: [Scikit-learn-general] Shape of classes_ varies?

2012-11-29 Thread Gilles Louppe
`i` is the output index, corresponding to the i-th column of y. On 29 November 2012 22:00, Lars Buitinck l.j.buiti...@uva.nl wrote: 2012/11/29 Gilles Louppe g.lou...@gmail.com: Yes, since decision trees handle multi-output problems, classes_[i] is an array containing the classes for the i-th

  1   2   >