Re: [Scikit-learn-general] gridsearchCV - overfitting

2016-05-12 Thread Josh Vredevoogd
Another point of confusion: You shouldn't be using clf.predict() to calculate roc auc, you need clf.predict_proba(). Roc is a measure of sorting and predict only gives you the predicted class, not the probability, so the roc "curve" can only have points at 0 and 1 instead of any probability in

Re: [Scikit-learn-general] DPGMM applied to 1-dimensional vector and variance problem

2016-05-12 Thread Andreas Mueller
Hi Johan. Unfortunately there are known problems with DPGMM https://github.com/scikit-learn/scikit-learn/issues/2454 There is a PR to reimplement: https://github.com/scikit-learn/scikit-learn/pull/4802 I didn't know about dpcluster, it seems unmaintained. But maybe something to compare

Re: [Scikit-learn-general] gridsearchCV - overfitting

2016-05-12 Thread Andreas Mueller
How did you evaluate on the development set? You should use "best_score_", not grid_search.score. On 05/12/2016 08:07 AM, A neuman wrote: thats actually what i did. and the difference is way to big. Should I do it withlout gridsearchCV? I'm just wondering why gridsearch giving me overfitted

[Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Amita Misra
Hi, I have a limited dataset and hence want to learn the parameters and also evaluate the final model. >From the documents it looks that nested cross validation is the way to do it. I have the code but still I want to be sure that I am not overfitting any way. pipeline=Pipeline([('scale',

Re: [Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Amita Misra
Actually I do not have an independent test set and hence I want to use it as an estimate for generalization performance. Hence my classifier is fixed SVM and I want to learn the parameters and also estimate an unbiased performance using only one set of data. I wanted to ensure that my code

Re: [Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Алексей Драль
Hi Amita, As far as I understand your question, you only need one CV loop to optimize your objective with scoring function provided: === pipeline=Pipeline([('scale', preprocessing.StandardScaler()),('filter', SelectKBest(f_regression)),('svr', svm.SVR())] C_range = [0.1, 1, 10, 100]

Re: [Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Sebastian Raschka
I would say there are 2 different applications of nested CV. You could use it for algorithm selection (with hyperparam tuning in the inner loop). Or, you could use it as an estimate of the generalization performance (only hyperparam tuning), which has been reported to be less biased than the a

Re: [Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Sebastian Raschka
I am not that much into the multi-processing implementation in scikit-learn / joblib, but I think this could be one issue why your mac hangs… I’d say that it’s probably the safest approach to only set the n_jobs parameter for the innermost object. E.g., if you 4 processors, you said the

Re: [Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Amita Misra
Oh yes, I get that now. All this while I was thinking there was an issue with the mac due to a similar issue discussed here https://github.com/scikit-learn/scikit-learn/issues/5115. Thanks a lot for clearing this up. I am going to change the loop and see if I can run the parallel implementation

Re: [Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Amita Misra
I had not thought about the n_jobs parameter, mainly because it does not run on my mac and the system just hangs if i use it. The same code runs on linux server though. I have one more clarification to seek. I was running it on server with this code. Would this be fine or may I move the n_jobs=3

Re: [Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Sebastian Raschka
You are welcome, and I am glad to hear that it works :). And “your" approach is definitely the cleaner way to do it … I think you just need to be a bit careful about the n_jobs parameter in practice, I would only set it to n_jobs=-1 in the inner loop. Best, Sebastian > On May 12, 2016, at

[Scikit-learn-general] gridsearchCV - overfitting

2016-05-12 Thread A neuman
Hello everyone, I'm having a bit trouble with the parameters that I've got from gridsearchCV. For example: If i'm using the parameter what i've got from grid seardh CV for example on RF oder k-nn and i test the model on the train set, i get everytime an AUC value about 1.00 or 0.99. The

Re: [Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Sebastian Raschka
I see; that’s what I thought. At first glance, the approach (code) looks correct to me but I haven’ t done it this way, yet. Typically, I use a more “manual” approach iterating over the outer folds manually (since I typically use nested CV for algo selection): gs_est = … your gridsearch,

Re: [Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Amita Misra
Thanks. Actually there were 2 people running the same experiments and the other person was doing as you have shown above. We were getting the same results but since methods were different I wanted to ensure that I am doing it the right way. Thanks, Amita On Thu, May 12, 2016 at 2:43 PM,

Re: [Scikit-learn-general] gridsearchCV - overfitting

2016-05-12 Thread Joel Nothman
This would be much clearer if you provided some code, but I think I get what you're saying. The final GridSearchCV model is trained on the full training set, so the fact that it perfectly fits that data with random forests is not altogether surprising. What you can say about the parameters is

Re: [Scikit-learn-general] gridsearchCV - overfitting

2016-05-12 Thread Olivier Grisel
2016-05-12 13:02 GMT+02:00 A neuman : > Thanks for the answer! > > but how should i check that its overfitted or not? Do a development / evaluation split of your dataset, for instance with the train_test_split utility first. Then train your GridSearchCV model on the

Re: [Scikit-learn-general] gridsearchCV - overfitting

2016-05-12 Thread A neuman
thats actually what i did. and the difference is way to big. Should I do it withlout gridsearchCV? I'm just wondering why gridsearch giving me overfitted values. I know that these are the best params and so on... but i thought i can skip the manual part where i test the params on my own.