Re: [Scikit-learn-general] Backward compat policy in utils

2014-09-12 Thread Mathieu Blondel
We should survey what other packages use. I'll have a look at what lightning uses later. Mathieu On Sat, Sep 13, 2014 at 2:23 AM, Andy wrote: > +1 of cleaning up __init__.py (maybe no implementations at all?) > +1 for making private methods start with underscore (which will break > everything

Re: [Scikit-learn-general] pre-tokenized data (not splitting on white space)

2014-09-12 Thread Patrick Short
Hi all, The following solved my issue: def pre_tokenized(doc): """doc is a list of tokenized lists (with pre-tokenized avlues) will be passed to sklearn to bypass analyzer""" return doc tfidf = TfidfVectorizer(analyzer=self.pre_tokenized) tfidf.fit(content) Seems lik

[Scikit-learn-general] pre-tokenized data (not splitting on white space)

2014-09-12 Thread Patrick Short
Hi all, I am trying to do tfidf/lsa on pre-tokenized data (MeSH tags for any biology folks out there) and am trying to skip tokenization since pre-processing has already done so. Unfortunately I am having trouble follow the 'tips and tricks' in the doc: Some tips and tricks: If documents are pre

Re: [Scikit-learn-general] binarizer with more levels

2014-09-12 Thread Pagliari, Roberto
I’m now getting this: 'Quantizer' object has no attribute 'get_params' Do I need to add some other classes to the declaration? Thanks, From: Joel Nothman [mailto:[email protected]] Sent: Thursday, September 11, 2014 9:37 PM To: scikit-learn-general Subject: Re: [Scikit-learn-general] binar

Re: [Scikit-learn-general] oob_score_ for random forests for regression

2014-09-12 Thread Arnaud Joly
Here the link to the issue https://github.com/scikit-learn/scikit-learn/issues/3455 Arnaud On 12 Sep 2014, at 20:01, Arnaud Joly wrote: > If you want to work on custom oob scoring, there is an issue opened > for it. > > Best regards, > Arnaud > > On 12 Sep 2014, at 19:01, Josh Wasserstein wr

Re: [Scikit-learn-general] oob_score_ for random forests for regression

2014-09-12 Thread Arnaud Joly
If you want to work on custom ooh scoring, there is an issue opened for it. Best regards, Arnaud On 12 Sep 2014, at 19:01, Josh Wasserstein wrote: > Thanks! Couldn't find it on the documentation. I may try adding that to a PR. > > Josh > > On Fri, Sep 12, 2014 at 10:07 AM, Arnaud Joly wrote:

Re: [Scikit-learn-general] getting different results with sklearn gridsearchCV

2014-09-12 Thread Pagliari, Roberto
Thanks for the suggestions. With that fix, scaling+gridsearch is giving me the same results (w.r.t. my own gridsearch). I will try add binning as well. Thank you again! From: Andy [mailto:[email protected]] Sent: Friday, September 12, 2014 1:18 PM To: [email protected]

Re: [Scikit-learn-general] Backward compat policy in utils

2014-09-12 Thread Andy
+1 of cleaning up __init__.py (maybe no implementations at all?) +1 for making private methods start with underscore (which will break everything ^^) Also we need to add utils to the References then. No idea how to decide what should be public and what not, though. On 09/08/2014 04:01 PM, Mat

Re: [Scikit-learn-general] binarizer with more levels

2014-09-12 Thread Andy
On 09/12/2014 06:20 PM, Pagliari, Roberto wrote: I added import sklearn.base.TransformerMixin but it says no module named TransofrmerMixin Because TransformerMixin is not a module but a class. You have to do from sklearn.base import TransformerMixin *From:*Joel Nothman [mailto:joel.noth..

Re: [Scikit-learn-general] getting different results with sklearn gridsearchCV

2014-09-12 Thread Andy
As Laurent said using StandardScaler again is not necessary. If you don't provide code for your custom grid-search, it is hard to say what the difference might be ;) Are the same parameters selected and are the scores during the grid-search the same? On 09/12/2014 06:31 PM, Pagliari, Robert

Re: [Scikit-learn-general] algorithm used to train the tree with option 'best'

2014-09-12 Thread Gilles Louppe
Yes, exactly. Le 12 sept. 2014 18:31, "Luca Puggini" a écrit : > Hey thanks a lot, > so basically in random Forest the split is done like in the algorithm > described in your thesis except that the search is not done on all the > variables but only on a random subset of them? (usually sqrt(p) or

Re: [Scikit-learn-general] oob_score_ for random forests for regression

2014-09-12 Thread Josh Wasserstein
Thanks! Couldn't find it on the documentation. I may try adding that to a PR. Josh On Fri, Sep 12, 2014 at 10:07 AM, Arnaud Joly wrote: > Hi, > > The r2_score metric is used. > > Best regards, > Arnaud > > On 12 Sep 2014, at 16:04, Josh Wasserstein wrote: > > What error metric is used for this

Re: [Scikit-learn-general] getting different results with sklearn gridsearchCV

2014-09-12 Thread Laurent Direr
Hi Roberto, You do not need to scale here (you can remove the 3 first lines), the point of the pipeline is actually to not have to do this: After this I make the predictions scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) y_pr

Re: [Scikit-learn-general] getting different results with sklearn gridsearchCV

2014-09-12 Thread Pagliari, Roberto
Hi Andy, I don't think the accuracy is an issue. I explicitly provided a score function and the problem persists. With my own gridsearch I don't use pipeline, just stratifiedKFold and average for every combination of the parameters. This is an example with scaling+svm using sklearn pipeline:

[Scikit-learn-general] algorithm used to train the tree with option 'best'

2014-09-12 Thread Luca Puggini
Hey thanks a lot, so basically in random Forest the split is done like in the algorithm described in your thesis except that the search is not done on all the variables but only on a random subset of them? (usually sqrt(p) or something like that) Let me know. Thanks, Luca Hi Luca, > > The "best"

Re: [Scikit-learn-general] binarizer with more levels

2014-09-12 Thread Pagliari, Roberto
I added import sklearn.base.TransformerMixin but it says no module named TransofrmerMixin From: Joel Nothman [mailto:[email protected]] Sent: Thursday, September 11, 2014 9:37 PM To: scikit-learn-general Subject: Re: [Scikit-learn-general] binarizer with more levels Good point. It should

Re: [Scikit-learn-general] getting different results with sklearn gridsearchCV

2014-09-12 Thread Andy
Hi Roberto. GridSearchCV uses accuracy for selection if not other method is specified, so there should be no difference. Could you provide code? Do you also create a pipeline when using your own grid search? I would imagine there is some difference in how you do the fitting in the pipeline.

Re: [Scikit-learn-general] binarizer with more levels

2014-09-12 Thread Pagliari, Roberto
Thank you, I’m not seeing “sklearn.base”. Which module do I need to import to be able to use it? Thanks, From: Joel Nothman [mailto:[email protected]] Sent: Thursday, September 11, 2014 9:37 PM To: scikit-learn-general Subject: Re: [Scikit-learn-general] binarizer with more levels Good p

Re: [Scikit-learn-general] getting different results with sklearn gridsearchCV

2014-09-12 Thread Pagliari, Roberto
Regarding my previous question, I suspect the difference lies in the scoring function. What is the default scoring function used by gridsearch? In my own implementation I am using number of correctly classified samples (no weighting) / total number of samples sklearn gridsearch function must b

Re: [Scikit-learn-general] algorithm used to train the tree with option 'best'

2014-09-12 Thread Gilles Louppe
Hi Luca, The "best" strategy consists in finding the best threshold, that is the one that maximizes impurity decrease, when trying to partition a node into a left and right nodes. By contrast, "random" does not look for the best split and simply draw the discretization threshold at random. For fu

[Scikit-learn-general] getting different results with sklearn gridsearchCV

2014-09-12 Thread Pagliari, Roberto
I am comparing the results of sklearn cross-validation and my own cross validation. I tested linearSVC under the following conditions: - Data scaling per grid search - Data scaling + 2-level quantization, per grid search Specifically, I have done the following: Sklearn gridSe

[Scikit-learn-general] algorithm used to train the tree with option 'best'

2014-09-12 Thread Luca Puggini
Hi, I am using random forest classifier and this algorithm train a tree defined as : DecisionTreeClassifier(criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, random_state=198200

Re: [Scikit-learn-general] oob_score_ for random forests for regression

2014-09-12 Thread Arnaud Joly
Hi, The r2_score metric is used. Best regards, Arnaud On 12 Sep 2014, at 16:04, Josh Wasserstein wrote: > What error metric is used for this? > > Josh -- Want excitement? Manually upgrade your production database. Wh

[Scikit-learn-general] oob_score_ for random forests for regression

2014-09-12 Thread Josh Wasserstein
What error metric is used for this? Josh -- Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick

Re: [Scikit-learn-general] SVC and unbalanced dataset

2014-09-12 Thread ZORAIDA HIDALGO SANCHEZ
Thanks for all the suggestions. I will try them and let you know. El 10/09/14 16:46, "Andy" escribió: >On 09/10/2014 09:07 AM, Gael Varoquaux wrote: >> How are you measuring your errors? If you are using the zero-one loss >> (accuracy score), you are taking in account only the binary decisions,