Re: [Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Amita Misra
Oh yes, I get that now. All this while I was thinking there was an issue
with the mac due to a similar issue discussed here
https://github.com/scikit-learn/scikit-learn/issues/5115.

Thanks a lot for  clearing this up. I am going to change the loop and see
if I can run the parallel implementation on mac.
It was probably running on server since its has many more processors..

Thanks,
Amita

On Thu, May 12, 2016 at 7:35 PM, Sebastian Raschka 
wrote:

> I am not that much into the multi-processing implementation in
> scikit-learn / joblib, but I think this could be one issue why your mac
> hangs… I’d say that it’s probably the safest approach to only set the
> n_jobs parameter for the innermost object.
>
> E.g., if you 4 processors, you said the GridSearch to 2 and a k-fold loop
> to e.g., 5, I can imagine that it would blow up because you are suddenly
> trying to run 10 processes on 4 processors if it makes sense!?
>
>
> > On May 12, 2016, at 10:26 PM, Amita Misra  wrote:
> >
> > I had not thought about the n_jobs parameter, mainly because it does not
> run on my mac and the system just hangs if i use it.
> > The same code runs on linux server though.
> >
> > I have one more clarification to seek.
> > I was running it on server with this code. Would this be fine or may I
> move the n_jobs=3 to GridSearchCV
> >
> > grid_search = GridSearchCV(pipeline,
> param_grid=param_grid,scoring=scoringcriteria,cv=5)
> > scores = cross_validation.cross_val_score(grid_search, X_train,
> Y_train,cv=cvfolds,n_jobs=3)
> >
> > Thanks,
> > Amita
> >
> > On Thu, May 12, 2016 at 6:58 PM, Sebastian Raschka 
> wrote:
> > You are welcome, and I am glad to hear that it works :). And “your"
> approach is definitely the cleaner way to do it … I think you just need to
> be a bit careful about the n_jobs parameter in practice, I would only set
> it to n_jobs=-1 in the inner loop.
> >
> > Best,
> > Sebastian
> >
> >
> > > On May 12, 2016, at 7:17 PM, Amita Misra  wrote:
> > >
> > > Thanks.
> > > Actually there were 2 people running the same experiments and the
> other person was doing as you have shown above.
> > > We were getting the same results but since methods were different I
> wanted to ensure that I am doing it the right way.
> > >
> > > Thanks,
> > > Amita
> > >
> > > On Thu, May 12, 2016 at 2:43 PM, Sebastian Raschka <
> se.rasc...@gmail.com> wrote:
> > > I see; that’s what I thought. At first glance, the approach (code)
> looks correct to me but I haven’ t done it this way, yet. Typically, I use
> a more “manual” approach iterating over the outer folds manually (since I
> typically use nested CV for algo selection):
> > >
> > >
> > > gs_est = … your gridsearch, pipeline, estimator with param grid and
> cv=5
> > > skfold = StratifiedKFold(y=y_train, n_folds=5, shuffle=True,
> random_state=123)
> > >
> > > for outer_train_idx, outer_valid_idx in skfold:
> > > gs_est.fit(X_train[outer_train_idx], y_train[outer_train_idx])
> > > y_pred = gs_est.predict(X_train[outer_valid_idx])
> > > acc = accuracy_score(y_true=y_train[outer_valid_idx],
> y_pred=y_pred)
> > > print(' | inner ACC %.2f%% | outer ACC %.2f%%' %
> (gs_est.best_score_ * 100, acc * 100))
> > > cv_scores[name].append(acc)
> > >
> > > However, it should essentially do the same thing as your code if I see
> it correctly.
> > >
> > >
> > > > On May 12, 2016, at 4:50 PM, Amita Misra  wrote:
> > > >
> > > > Actually I do not have an independent test set and hence I want to
> use it as an estimate for generalization performance. Hence my classifier
> is fixed SVM and I want to learn the parameters and also estimate an
> unbiased performance using only one set of data.
> > > >
> > > > I wanted to ensure that my code correctly does a nested 10*5 CV and
> the parameters are learnt on a different set and final evaluation to get
> the predicted score is on a different set.
> > > >
> > > > Amita
> > > >
> > > >
> > > >
> > > > On Thu, May 12, 2016 at 1:24 PM, Sebastian Raschka <
> se.rasc...@gmail.com> wrote:
> > > > I would say there are 2 different applications of nested CV. You
> could use it for algorithm selection (with hyperparam tuning in the inner
> loop). Or, you could use it as an estimate of the generalization
> performance (only hyperparam tuning), which has been reported to be less
> biased than the a k-fold CV estimate (Varma, S., & Simon, R. (2006). Bias
> in error estimation when using cross-validation for model selection. BMC
> Bioinformatics, 7, 91. http://doi.org/10.1186/1471-2105-7-91)
> > > >
> > > > By  "you could use it as an estimate of the generalization
> performance (only hyperparam tuning)” I mean as a replacement for k-fold on
> the training set and evaluation on an independent test set.
> > > >
> > > > > On May 12, 2016, at 4:16 PM, Алексей Драль 
> wrote:
> > > > >
> > > > > Hi Amita,
> > 

Re: [Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Sebastian Raschka
I am not that much into the multi-processing implementation in scikit-learn / 
joblib, but I think this could be one issue why your mac hangs… I’d say that 
it’s probably the safest approach to only set the n_jobs parameter for the 
innermost object.

E.g., if you 4 processors, you said the GridSearch to 2 and a k-fold loop to 
e.g., 5, I can imagine that it would blow up because you are suddenly trying to 
run 10 processes on 4 processors if it makes sense!?


> On May 12, 2016, at 10:26 PM, Amita Misra  wrote:
> 
> I had not thought about the n_jobs parameter, mainly because it does not run 
> on my mac and the system just hangs if i use it.
> The same code runs on linux server though.
> 
> I have one more clarification to seek.
> I was running it on server with this code. Would this be fine or may I move 
> the n_jobs=3 to GridSearchCV
> 
> grid_search = GridSearchCV(pipeline, 
> param_grid=param_grid,scoring=scoringcriteria,cv=5) 
> scores = cross_validation.cross_val_score(grid_search, X_train, 
> Y_train,cv=cvfolds,n_jobs=3)
> 
> Thanks,
> Amita
> 
> On Thu, May 12, 2016 at 6:58 PM, Sebastian Raschka  
> wrote:
> You are welcome, and I am glad to hear that it works :). And “your" approach 
> is definitely the cleaner way to do it … I think you just need to be a bit 
> careful about the n_jobs parameter in practice, I would only set it to 
> n_jobs=-1 in the inner loop.
> 
> Best,
> Sebastian
> 
> 
> > On May 12, 2016, at 7:17 PM, Amita Misra  wrote:
> >
> > Thanks.
> > Actually there were 2 people running the same experiments and the other 
> > person was doing as you have shown above.
> > We were getting the same results but since methods were different I  wanted 
> > to ensure that I am doing it the right way.
> >
> > Thanks,
> > Amita
> >
> > On Thu, May 12, 2016 at 2:43 PM, Sebastian Raschka  
> > wrote:
> > I see; that’s what I thought. At first glance, the approach (code) looks 
> > correct to me but I haven’ t done it this way, yet. Typically, I use a more 
> > “manual” approach iterating over the outer folds manually (since I 
> > typically use nested CV for algo selection):
> >
> >
> > gs_est = … your gridsearch, pipeline, estimator with param grid and cv=5
> > skfold = StratifiedKFold(y=y_train, n_folds=5, shuffle=True, 
> > random_state=123)
> >
> > for outer_train_idx, outer_valid_idx in skfold:
> > gs_est.fit(X_train[outer_train_idx], y_train[outer_train_idx])
> > y_pred = gs_est.predict(X_train[outer_valid_idx])
> > acc = accuracy_score(y_true=y_train[outer_valid_idx], 
> > y_pred=y_pred)
> > print(' | inner ACC %.2f%% | outer ACC %.2f%%' % 
> > (gs_est.best_score_ * 100, acc * 100))
> > cv_scores[name].append(acc)
> >
> > However, it should essentially do the same thing as your code if I see it 
> > correctly.
> >
> >
> > > On May 12, 2016, at 4:50 PM, Amita Misra  wrote:
> > >
> > > Actually I do not have an independent test set and hence I want to use it 
> > > as an estimate for generalization performance. Hence my classifier is 
> > > fixed SVM and I want to learn the parameters and also estimate an 
> > > unbiased performance using only one set of data.
> > >
> > > I wanted to ensure that my code correctly does a nested 10*5 CV and the 
> > > parameters are learnt on a different set and final evaluation to get the 
> > > predicted score is on a different set.
> > >
> > > Amita
> > >
> > >
> > >
> > > On Thu, May 12, 2016 at 1:24 PM, Sebastian Raschka  
> > > wrote:
> > > I would say there are 2 different applications of nested CV. You could 
> > > use it for algorithm selection (with hyperparam tuning in the inner 
> > > loop). Or, you could use it as an estimate of the generalization 
> > > performance (only hyperparam tuning), which has been reported to be less 
> > > biased than the a k-fold CV estimate (Varma, S., & Simon, R. (2006). Bias 
> > > in error estimation when using cross-validation for model selection. BMC 
> > > Bioinformatics, 7, 91. http://doi.org/10.1186/1471-2105-7-91)
> > >
> > > By  "you could use it as an estimate of the generalization performance 
> > > (only hyperparam tuning)” I mean as a replacement for k-fold on the 
> > > training set and evaluation on an independent test set.
> > >
> > > > On May 12, 2016, at 4:16 PM, Алексей Драль  wrote:
> > > >
> > > > Hi Amita,
> > > >
> > > > As far as I understand your question, you only need one CV loop to 
> > > > optimize your objective with scoring function provided:
> > > >
> > > > ===
> > > > pipeline=Pipeline([('scale', preprocessing.StandardScaler()),('filter', 
> > > > SelectKBest(f_regression)),('svr', svm.SVR())]
> > > > C_range = [0.1, 1, 10, 100]
> > > > gamma_range=numpy.logspace(-2, 2, 5)
> > > > param_grid=[{'svr__kernel': ['rbf'], 'svr__gamma': 
> > > > gamma_range,'svr__C': C_range}]
> > > > grid_search = 

Re: [Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Amita Misra
I had not thought about the n_jobs parameter, mainly because it does not
run on my mac and the system just hangs if i use it.
The same code runs on linux server though.

I have one more clarification to seek.
I was running it on server with this code. Would this be fine or may I move
the n_jobs=3 to GridSearchCV

grid_search = GridSearchCV(pipeline,
param_grid=param_grid,scoring=scoringcriteria,cv=5)
scores = cross_validation.cross_val_score(grid_search, X_train,
Y_train,cv=cvfolds,n_jobs=3)

Thanks,
Amita

On Thu, May 12, 2016 at 6:58 PM, Sebastian Raschka 
wrote:

> You are welcome, and I am glad to hear that it works :). And “your"
> approach is definitely the cleaner way to do it … I think you just need to
> be a bit careful about the n_jobs parameter in practice, I would only set
> it to n_jobs=-1 in the inner loop.
>
> Best,
> Sebastian
>
>
> > On May 12, 2016, at 7:17 PM, Amita Misra  wrote:
> >
> > Thanks.
> > Actually there were 2 people running the same experiments and the other
> person was doing as you have shown above.
> > We were getting the same results but since methods were different I
> wanted to ensure that I am doing it the right way.
> >
> > Thanks,
> > Amita
> >
> > On Thu, May 12, 2016 at 2:43 PM, Sebastian Raschka 
> wrote:
> > I see; that’s what I thought. At first glance, the approach (code) looks
> correct to me but I haven’ t done it this way, yet. Typically, I use a more
> “manual” approach iterating over the outer folds manually (since I
> typically use nested CV for algo selection):
> >
> >
> > gs_est = … your gridsearch, pipeline, estimator with param grid and cv=5
> > skfold = StratifiedKFold(y=y_train, n_folds=5, shuffle=True,
> random_state=123)
> >
> > for outer_train_idx, outer_valid_idx in skfold:
> > gs_est.fit(X_train[outer_train_idx], y_train[outer_train_idx])
> > y_pred = gs_est.predict(X_train[outer_valid_idx])
> > acc = accuracy_score(y_true=y_train[outer_valid_idx],
> y_pred=y_pred)
> > print(' | inner ACC %.2f%% | outer ACC %.2f%%' %
> (gs_est.best_score_ * 100, acc * 100))
> > cv_scores[name].append(acc)
> >
> > However, it should essentially do the same thing as your code if I see
> it correctly.
> >
> >
> > > On May 12, 2016, at 4:50 PM, Amita Misra  wrote:
> > >
> > > Actually I do not have an independent test set and hence I want to use
> it as an estimate for generalization performance. Hence my classifier is
> fixed SVM and I want to learn the parameters and also estimate an unbiased
> performance using only one set of data.
> > >
> > > I wanted to ensure that my code correctly does a nested 10*5 CV and
> the parameters are learnt on a different set and final evaluation to get
> the predicted score is on a different set.
> > >
> > > Amita
> > >
> > >
> > >
> > > On Thu, May 12, 2016 at 1:24 PM, Sebastian Raschka <
> se.rasc...@gmail.com> wrote:
> > > I would say there are 2 different applications of nested CV. You could
> use it for algorithm selection (with hyperparam tuning in the inner loop).
> Or, you could use it as an estimate of the generalization performance (only
> hyperparam tuning), which has been reported to be less biased than the a
> k-fold CV estimate (Varma, S., & Simon, R. (2006). Bias in error estimation
> when using cross-validation for model selection. BMC Bioinformatics, 7, 91.
> http://doi.org/10.1186/1471-2105-7-91)
> > >
> > > By  "you could use it as an estimate of the generalization performance
> (only hyperparam tuning)” I mean as a replacement for k-fold on the
> training set and evaluation on an independent test set.
> > >
> > > > On May 12, 2016, at 4:16 PM, Алексей Драль  wrote:
> > > >
> > > > Hi Amita,
> > > >
> > > > As far as I understand your question, you only need one CV loop to
> optimize your objective with scoring function provided:
> > > >
> > > > ===
> > > > pipeline=Pipeline([('scale',
> preprocessing.StandardScaler()),('filter',
> SelectKBest(f_regression)),('svr', svm.SVR())]
> > > > C_range = [0.1, 1, 10, 100]
> > > > gamma_range=numpy.logspace(-2, 2, 5)
> > > > param_grid=[{'svr__kernel': ['rbf'], 'svr__gamma':
> gamma_range,'svr__C': C_range}]
> > > > grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5,
> scoring=scoring_function)
> > > > grid_search.fit(X_train, Y_train)
> > > > ===
> > > >
> > > > More details about it you should be able to find in documentation:
> > > >   •
> http://scikit-learn.org/stable/modules/grid_search.html#grid-search
> > > >   •
> http://scikit-learn.org/stable/modules/grid_search.html#gridsearch-scoring
> > > >
> > > > 2016-05-12 17:05 GMT+01:00 Amita Misra :
> > > > Hi,
> > > >
> > > > I have a limited dataset and hence want  to learn the parameters and
> also evaluate the final model.
> > > > From the documents it looks that nested cross validation is the way
> to do it. I have the code 

Re: [Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Sebastian Raschka
You are welcome, and I am glad to hear that it works :). And “your" approach is 
definitely the cleaner way to do it … I think you just need to be a bit careful 
about the n_jobs parameter in practice, I would only set it to n_jobs=-1 in the 
inner loop.

Best,
Sebastian


> On May 12, 2016, at 7:17 PM, Amita Misra  wrote:
> 
> Thanks.
> Actually there were 2 people running the same experiments and the other 
> person was doing as you have shown above. 
> We were getting the same results but since methods were different I  wanted 
> to ensure that I am doing it the right way.
> 
> Thanks,
> Amita
> 
> On Thu, May 12, 2016 at 2:43 PM, Sebastian Raschka  
> wrote:
> I see; that’s what I thought. At first glance, the approach (code) looks 
> correct to me but I haven’ t done it this way, yet. Typically, I use a more 
> “manual” approach iterating over the outer folds manually (since I typically 
> use nested CV for algo selection):
> 
> 
> gs_est = … your gridsearch, pipeline, estimator with param grid and cv=5
> skfold = StratifiedKFold(y=y_train, n_folds=5, shuffle=True, random_state=123)
> 
> for outer_train_idx, outer_valid_idx in skfold:
> gs_est.fit(X_train[outer_train_idx], y_train[outer_train_idx])
> y_pred = gs_est.predict(X_train[outer_valid_idx])
> acc = accuracy_score(y_true=y_train[outer_valid_idx], 
> y_pred=y_pred)
> print(' | inner ACC %.2f%% | outer ACC %.2f%%' % 
> (gs_est.best_score_ * 100, acc * 100))
> cv_scores[name].append(acc)
> 
> However, it should essentially do the same thing as your code if I see it 
> correctly.
> 
> 
> > On May 12, 2016, at 4:50 PM, Amita Misra  wrote:
> >
> > Actually I do not have an independent test set and hence I want to use it 
> > as an estimate for generalization performance. Hence my classifier is fixed 
> > SVM and I want to learn the parameters and also estimate an unbiased 
> > performance using only one set of data.
> >
> > I wanted to ensure that my code correctly does a nested 10*5 CV and the 
> > parameters are learnt on a different set and final evaluation to get the 
> > predicted score is on a different set.
> >
> > Amita
> >
> >
> >
> > On Thu, May 12, 2016 at 1:24 PM, Sebastian Raschka  
> > wrote:
> > I would say there are 2 different applications of nested CV. You could use 
> > it for algorithm selection (with hyperparam tuning in the inner loop). Or, 
> > you could use it as an estimate of the generalization performance (only 
> > hyperparam tuning), which has been reported to be less biased than the a 
> > k-fold CV estimate (Varma, S., & Simon, R. (2006). Bias in error estimation 
> > when using cross-validation for model selection. BMC Bioinformatics, 7, 91. 
> > http://doi.org/10.1186/1471-2105-7-91)
> >
> > By  "you could use it as an estimate of the generalization performance 
> > (only hyperparam tuning)” I mean as a replacement for k-fold on the 
> > training set and evaluation on an independent test set.
> >
> > > On May 12, 2016, at 4:16 PM, Алексей Драль  wrote:
> > >
> > > Hi Amita,
> > >
> > > As far as I understand your question, you only need one CV loop to 
> > > optimize your objective with scoring function provided:
> > >
> > > ===
> > > pipeline=Pipeline([('scale', preprocessing.StandardScaler()),('filter', 
> > > SelectKBest(f_regression)),('svr', svm.SVR())]
> > > C_range = [0.1, 1, 10, 100]
> > > gamma_range=numpy.logspace(-2, 2, 5)
> > > param_grid=[{'svr__kernel': ['rbf'], 'svr__gamma': gamma_range,'svr__C': 
> > > C_range}]
> > > grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, 
> > > scoring=scoring_function)
> > > grid_search.fit(X_train, Y_train)
> > > ===
> > >
> > > More details about it you should be able to find in documentation:
> > >   • 
> > > http://scikit-learn.org/stable/modules/grid_search.html#grid-search
> > >   • 
> > > http://scikit-learn.org/stable/modules/grid_search.html#gridsearch-scoring
> > >
> > > 2016-05-12 17:05 GMT+01:00 Amita Misra :
> > > Hi,
> > >
> > > I have a limited dataset and hence want  to learn the parameters and also 
> > > evaluate the final model.
> > > From the documents it looks that nested cross validation is the way to do 
> > > it. I have the code but still I want to be sure that I am not overfitting 
> > > any way.
> > >
> > > pipeline=Pipeline([('scale', preprocessing.StandardScaler()),('filter', 
> > > SelectKBest(f_regression)),('svr', svm.SVR())]
> > > C_range = [0.1, 1, 10, 100]
> > > gamma_range=numpy.logspace(-2, 2, 5)
> > > param_grid=[{'svr__kernel': ['rbf'], 'svr__gamma': gamma_range,'svr__C': 
> > > C_range}]
> > > grid_search = GridSearchCV(pipeline, param_grid=param_grid,cv=5) 
> > > Y_pred=cross_validation.cross_val_predict(grid_search, X_train, 
> > > Y_train,cv=10)
> > >
> > > correlation=  numpy.ma.corrcoef(Y_train,Y_pred)[0, 1]
> > >
> > >
> > > please let 

Re: [Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Amita Misra
Thanks.
Actually there were 2 people running the same experiments and the other
person was doing as you have shown above.
We were getting the same results but since methods were different I  wanted
to ensure that I am doing it the right way.

Thanks,
Amita

On Thu, May 12, 2016 at 2:43 PM, Sebastian Raschka 
wrote:

> I see; that’s what I thought. At first glance, the approach (code) looks
> correct to me but I haven’ t done it this way, yet. Typically, I use a more
> “manual” approach iterating over the outer folds manually (since I
> typically use nested CV for algo selection):
>
>
> gs_est = … your gridsearch, pipeline, estimator with param grid and cv=5
> skfold = StratifiedKFold(y=y_train, n_folds=5, shuffle=True,
> random_state=123)
>
> for outer_train_idx, outer_valid_idx in skfold:
> gs_est.fit(X_train[outer_train_idx], y_train[outer_train_idx])
> y_pred = gs_est.predict(X_train[outer_valid_idx])
> acc = accuracy_score(y_true=y_train[outer_valid_idx],
> y_pred=y_pred)
> print(' | inner ACC %.2f%% | outer ACC %.2f%%' %
> (gs_est.best_score_ * 100, acc * 100))
> cv_scores[name].append(acc)
>
> However, it should essentially do the same thing as your code if I see it
> correctly.
>
>
> > On May 12, 2016, at 4:50 PM, Amita Misra  wrote:
> >
> > Actually I do not have an independent test set and hence I want to use
> it as an estimate for generalization performance. Hence my classifier is
> fixed SVM and I want to learn the parameters and also estimate an unbiased
> performance using only one set of data.
> >
> > I wanted to ensure that my code correctly does a nested 10*5 CV and the
> parameters are learnt on a different set and final evaluation to get the
> predicted score is on a different set.
> >
> > Amita
> >
> >
> >
> > On Thu, May 12, 2016 at 1:24 PM, Sebastian Raschka 
> wrote:
> > I would say there are 2 different applications of nested CV. You could
> use it for algorithm selection (with hyperparam tuning in the inner loop).
> Or, you could use it as an estimate of the generalization performance (only
> hyperparam tuning), which has been reported to be less biased than the a
> k-fold CV estimate (Varma, S., & Simon, R. (2006). Bias in error estimation
> when using cross-validation for model selection. BMC Bioinformatics, 7, 91.
> http://doi.org/10.1186/1471-2105-7-91)
> >
> > By  "you could use it as an estimate of the generalization performance
> (only hyperparam tuning)” I mean as a replacement for k-fold on the
> training set and evaluation on an independent test set.
> >
> > > On May 12, 2016, at 4:16 PM, Алексей Драль  wrote:
> > >
> > > Hi Amita,
> > >
> > > As far as I understand your question, you only need one CV loop to
> optimize your objective with scoring function provided:
> > >
> > > ===
> > > pipeline=Pipeline([('scale',
> preprocessing.StandardScaler()),('filter',
> SelectKBest(f_regression)),('svr', svm.SVR())]
> > > C_range = [0.1, 1, 10, 100]
> > > gamma_range=numpy.logspace(-2, 2, 5)
> > > param_grid=[{'svr__kernel': ['rbf'], 'svr__gamma':
> gamma_range,'svr__C': C_range}]
> > > grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5,
> scoring=scoring_function)
> > > grid_search.fit(X_train, Y_train)
> > > ===
> > >
> > > More details about it you should be able to find in documentation:
> > >   •
> http://scikit-learn.org/stable/modules/grid_search.html#grid-search
> > >   •
> http://scikit-learn.org/stable/modules/grid_search.html#gridsearch-scoring
> > >
> > > 2016-05-12 17:05 GMT+01:00 Amita Misra :
> > > Hi,
> > >
> > > I have a limited dataset and hence want  to learn the parameters and
> also evaluate the final model.
> > > From the documents it looks that nested cross validation is the way to
> do it. I have the code but still I want to be sure that I am not
> overfitting any way.
> > >
> > > pipeline=Pipeline([('scale',
> preprocessing.StandardScaler()),('filter',
> SelectKBest(f_regression)),('svr', svm.SVR())]
> > > C_range = [0.1, 1, 10, 100]
> > > gamma_range=numpy.logspace(-2, 2, 5)
> > > param_grid=[{'svr__kernel': ['rbf'], 'svr__gamma':
> gamma_range,'svr__C': C_range}]
> > > grid_search = GridSearchCV(pipeline, param_grid=param_grid,cv=5)
> Y_pred=cross_validation.cross_val_predict(grid_search, X_train,
> Y_train,cv=10)
> > >
> > > correlation=  numpy.ma.corrcoef(Y_train,Y_pred)[0, 1]
> > >
> > >
> > > please let me know if my understanding is correct.
> > >
> > > This is 10*5 nested cross validation. Inner folds CV over training
> data involves a grid search over hyperparameters and outer folds evaluate
> the performance.
> > >
> > >
> > >
> > > Thanks,
> > > Amita--
> > > Amita Misra
> > > Graduate Student Researcher
> > > Natural Language and Dialogue Systems Lab
> > > Baskin School of Engineering
> > > University of California Santa Cruz
> > >
> > >
> > >
> 

Re: [Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Sebastian Raschka
I see; that’s what I thought. At first glance, the approach (code) looks 
correct to me but I haven’ t done it this way, yet. Typically, I use a more 
“manual” approach iterating over the outer folds manually (since I typically 
use nested CV for algo selection):


gs_est = … your gridsearch, pipeline, estimator with param grid and cv=5
skfold = StratifiedKFold(y=y_train, n_folds=5, shuffle=True, random_state=123)

for outer_train_idx, outer_valid_idx in skfold:
gs_est.fit(X_train[outer_train_idx], y_train[outer_train_idx])
y_pred = gs_est.predict(X_train[outer_valid_idx])
acc = accuracy_score(y_true=y_train[outer_valid_idx], y_pred=y_pred)
print(' | inner ACC %.2f%% | outer ACC %.2f%%' % 
(gs_est.best_score_ * 100, acc * 100))
cv_scores[name].append(acc)

However, it should essentially do the same thing as your code if I see it 
correctly.


> On May 12, 2016, at 4:50 PM, Amita Misra  wrote:
> 
> Actually I do not have an independent test set and hence I want to use it as 
> an estimate for generalization performance. Hence my classifier is fixed SVM 
> and I want to learn the parameters and also estimate an unbiased performance 
> using only one set of data.
> 
> I wanted to ensure that my code correctly does a nested 10*5 CV and the 
> parameters are learnt on a different set and final evaluation to get the 
> predicted score is on a different set.
> 
> Amita
> 
> 
> 
> On Thu, May 12, 2016 at 1:24 PM, Sebastian Raschka  
> wrote:
> I would say there are 2 different applications of nested CV. You could use it 
> for algorithm selection (with hyperparam tuning in the inner loop). Or, you 
> could use it as an estimate of the generalization performance (only 
> hyperparam tuning), which has been reported to be less biased than the a 
> k-fold CV estimate (Varma, S., & Simon, R. (2006). Bias in error estimation 
> when using cross-validation for model selection. BMC Bioinformatics, 7, 91. 
> http://doi.org/10.1186/1471-2105-7-91)
> 
> By  "you could use it as an estimate of the generalization performance (only 
> hyperparam tuning)” I mean as a replacement for k-fold on the training set 
> and evaluation on an independent test set.
> 
> > On May 12, 2016, at 4:16 PM, Алексей Драль  wrote:
> >
> > Hi Amita,
> >
> > As far as I understand your question, you only need one CV loop to optimize 
> > your objective with scoring function provided:
> >
> > ===
> > pipeline=Pipeline([('scale', preprocessing.StandardScaler()),('filter', 
> > SelectKBest(f_regression)),('svr', svm.SVR())]
> > C_range = [0.1, 1, 10, 100]
> > gamma_range=numpy.logspace(-2, 2, 5)
> > param_grid=[{'svr__kernel': ['rbf'], 'svr__gamma': gamma_range,'svr__C': 
> > C_range}]
> > grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, 
> > scoring=scoring_function)
> > grid_search.fit(X_train, Y_train)
> > ===
> >
> > More details about it you should be able to find in documentation:
> >   • http://scikit-learn.org/stable/modules/grid_search.html#grid-search
> >   • 
> > http://scikit-learn.org/stable/modules/grid_search.html#gridsearch-scoring
> >
> > 2016-05-12 17:05 GMT+01:00 Amita Misra :
> > Hi,
> >
> > I have a limited dataset and hence want  to learn the parameters and also 
> > evaluate the final model.
> > From the documents it looks that nested cross validation is the way to do 
> > it. I have the code but still I want to be sure that I am not overfitting 
> > any way.
> >
> > pipeline=Pipeline([('scale', preprocessing.StandardScaler()),('filter', 
> > SelectKBest(f_regression)),('svr', svm.SVR())]
> > C_range = [0.1, 1, 10, 100]
> > gamma_range=numpy.logspace(-2, 2, 5)
> > param_grid=[{'svr__kernel': ['rbf'], 'svr__gamma': gamma_range,'svr__C': 
> > C_range}]
> > grid_search = GridSearchCV(pipeline, param_grid=param_grid,cv=5) 
> > Y_pred=cross_validation.cross_val_predict(grid_search, X_train, 
> > Y_train,cv=10)
> >
> > correlation=  numpy.ma.corrcoef(Y_train,Y_pred)[0, 1]
> >
> >
> > please let me know if my understanding is correct.
> >
> > This is 10*5 nested cross validation. Inner folds CV over training data 
> > involves a grid search over hyperparameters and outer folds evaluate the 
> > performance.
> >
> >
> >
> > Thanks,
> > Amita--
> > Amita Misra
> > Graduate Student Researcher
> > Natural Language and Dialogue Systems Lab
> > Baskin School of Engineering
> > University of California Santa Cruz
> >
> >
> > --
> > Mobile security can be enabling, not merely restricting. Employees who
> > bring their own devices (BYOD) to work are irked by the imposition of MDM
> > restrictions. Mobile Device Manager Plus allows you to control only the
> > apps on BYO-devices by containerizing them, leaving personal data untouched!
> > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
> > 

Re: [Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Amita Misra
Actually I do not have an independent test set and hence I want to use it
as an estimate for generalization performance. Hence my classifier is fixed
SVM and I want to learn the parameters and also estimate an unbiased
performance using only one set of data.

I wanted to ensure that my code correctly does a nested 10*5 CV and the
parameters are learnt on a different set and final evaluation to get the
predicted score is on a different set.

Amita



On Thu, May 12, 2016 at 1:24 PM, Sebastian Raschka 
wrote:

> I would say there are 2 different applications of nested CV. You could use
> it for algorithm selection (with hyperparam tuning in the inner loop). Or,
> you could use it as an estimate of the generalization performance (only
> hyperparam tuning), which has been reported to be less biased than the a
> k-fold CV estimate (Varma, S., & Simon, R. (2006). Bias in error estimation
> when using cross-validation for model selection. BMC Bioinformatics, 7, 91.
> http://doi.org/10.1186/1471-2105-7-91)
>
> By  "you could use it as an estimate of the generalization performance
> (only hyperparam tuning)” I mean as a replacement for k-fold on the
> training set and evaluation on an independent test set.
>
> > On May 12, 2016, at 4:16 PM, Алексей Драль  wrote:
> >
> > Hi Amita,
> >
> > As far as I understand your question, you only need one CV loop to
> optimize your objective with scoring function provided:
> >
> > ===
> > pipeline=Pipeline([('scale', preprocessing.StandardScaler()),('filter',
> SelectKBest(f_regression)),('svr', svm.SVR())]
> > C_range = [0.1, 1, 10, 100]
> > gamma_range=numpy.logspace(-2, 2, 5)
> > param_grid=[{'svr__kernel': ['rbf'], 'svr__gamma': gamma_range,'svr__C':
> C_range}]
> > grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5,
> scoring=scoring_function)
> > grid_search.fit(X_train, Y_train)
> > ===
> >
> > More details about it you should be able to find in documentation:
> >   •
> http://scikit-learn.org/stable/modules/grid_search.html#grid-search
> >   •
> http://scikit-learn.org/stable/modules/grid_search.html#gridsearch-scoring
> >
> > 2016-05-12 17:05 GMT+01:00 Amita Misra :
> > Hi,
> >
> > I have a limited dataset and hence want  to learn the parameters and
> also evaluate the final model.
> > From the documents it looks that nested cross validation is the way to
> do it. I have the code but still I want to be sure that I am not
> overfitting any way.
> >
> > pipeline=Pipeline([('scale', preprocessing.StandardScaler()),('filter',
> SelectKBest(f_regression)),('svr', svm.SVR())]
> > C_range = [0.1, 1, 10, 100]
> > gamma_range=numpy.logspace(-2, 2, 5)
> > param_grid=[{'svr__kernel': ['rbf'], 'svr__gamma': gamma_range,'svr__C':
> C_range}]
> > grid_search = GridSearchCV(pipeline, param_grid=param_grid,cv=5)
> Y_pred=cross_validation.cross_val_predict(grid_search, X_train,
> Y_train,cv=10)
> >
> > correlation=  numpy.ma.corrcoef(Y_train,Y_pred)[0, 1]
> >
> >
> > please let me know if my understanding is correct.
> >
> > This is 10*5 nested cross validation. Inner folds CV over training data
> involves a grid search over hyperparameters and outer folds evaluate the
> performance.
> >
> >
> >
> > Thanks,
> > Amita--
> > Amita Misra
> > Graduate Student Researcher
> > Natural Language and Dialogue Systems Lab
> > Baskin School of Engineering
> > University of California Santa Cruz
> >
> >
> >
> --
> > Mobile security can be enabling, not merely restricting. Employees who
> > bring their own devices (BYOD) to work are irked by the imposition of MDM
> > restrictions. Mobile Device Manager Plus allows you to control only the
> > apps on BYO-devices by containerizing them, leaving personal data
> untouched!
> > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
> > ___
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> >
> >
> > --
> > Yours sincerely,
> > Alexey A. Dral
> >
> --
> > Mobile security can be enabling, not merely restricting. Employees who
> > bring their own devices (BYOD) to work are irked by the imposition of MDM
> > restrictions. Mobile Device Manager Plus allows you to control only the
> > apps on BYO-devices by containerizing them, leaving personal data
> untouched!
> >
> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j___
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> --
> Mobile security can be enabling, not merely restricting. Employees 

Re: [Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Sebastian Raschka
I would say there are 2 different applications of nested CV. You could use it 
for algorithm selection (with hyperparam tuning in the inner loop). Or, you 
could use it as an estimate of the generalization performance (only hyperparam 
tuning), which has been reported to be less biased than the a k-fold CV 
estimate (Varma, S., & Simon, R. (2006). Bias in error estimation when using 
cross-validation for model selection. BMC Bioinformatics, 7, 91. 
http://doi.org/10.1186/1471-2105-7-91)

By  "you could use it as an estimate of the generalization performance (only 
hyperparam tuning)” I mean as a replacement for k-fold on the training set and 
evaluation on an independent test set.

> On May 12, 2016, at 4:16 PM, Алексей Драль  wrote:
> 
> Hi Amita,
> 
> As far as I understand your question, you only need one CV loop to optimize 
> your objective with scoring function provided:
> 
> ===
> pipeline=Pipeline([('scale', preprocessing.StandardScaler()),('filter', 
> SelectKBest(f_regression)),('svr', svm.SVR())]
> C_range = [0.1, 1, 10, 100]
> gamma_range=numpy.logspace(-2, 2, 5)
> param_grid=[{'svr__kernel': ['rbf'], 'svr__gamma': gamma_range,'svr__C': 
> C_range}]
> grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, 
> scoring=scoring_function)
> grid_search.fit(X_train, Y_train)
> ===
> 
> More details about it you should be able to find in documentation:
>   • http://scikit-learn.org/stable/modules/grid_search.html#grid-search
>   • 
> http://scikit-learn.org/stable/modules/grid_search.html#gridsearch-scoring
> 
> 2016-05-12 17:05 GMT+01:00 Amita Misra :
> Hi,
> 
> I have a limited dataset and hence want  to learn the parameters and also 
> evaluate the final model. 
> From the documents it looks that nested cross validation is the way to do it. 
> I have the code but still I want to be sure that I am not overfitting any way.
> 
> pipeline=Pipeline([('scale', preprocessing.StandardScaler()),('filter', 
> SelectKBest(f_regression)),('svr', svm.SVR())]
> C_range = [0.1, 1, 10, 100]
> gamma_range=numpy.logspace(-2, 2, 5)
> param_grid=[{'svr__kernel': ['rbf'], 'svr__gamma': gamma_range,'svr__C': 
> C_range}]
> grid_search = GridSearchCV(pipeline, param_grid=param_grid,cv=5) 
> Y_pred=cross_validation.cross_val_predict(grid_search, X_train, Y_train,cv=10)
> 
> correlation=  numpy.ma.corrcoef(Y_train,Y_pred)[0, 1]
> 
> 
> please let me know if my understanding is correct.
> 
> This is 10*5 nested cross validation. Inner folds CV over training data 
> involves a grid search over hyperparameters and outer folds evaluate the 
> performance.
> 
> 
> 
> Thanks,
> Amita-- 
> Amita Misra
> Graduate Student Researcher
> Natural Language and Dialogue Systems Lab
> Baskin School of Engineering
> University of California Santa Cruz
> 
> 
> --
> Mobile security can be enabling, not merely restricting. Employees who
> bring their own devices (BYOD) to work are irked by the imposition of MDM
> restrictions. Mobile Device Manager Plus allows you to control only the
> apps on BYO-devices by containerizing them, leaving personal data untouched!
> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 
> 
> 
> -- 
> Yours sincerely,
> Alexey A. Dral
> --
> Mobile security can be enabling, not merely restricting. Employees who
> bring their own devices (BYOD) to work are irked by the imposition of MDM
> restrictions. Mobile Device Manager Plus allows you to control only the
> apps on BYO-devices by containerizing them, leaving personal data untouched!
> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Алексей Драль
Hi Amita,

As far as I understand your question, you only need one CV loop to optimize
your objective with scoring function provided:

===
pipeline=Pipeline([('scale', preprocessing.StandardScaler()),('filter',
SelectKBest(f_regression)),('svr', svm.SVR())]
C_range = [0.1, 1, 10, 100]
gamma_range=numpy.logspace(-2, 2, 5)
param_grid=[{'svr__kernel': ['rbf'], 'svr__gamma': gamma_range,'svr__C':
C_range}]
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5*,
scoring=scoring_function*)
grid_search.fit(X_train, Y_train)
===

More details about it you should be able to find in documentation:

   - http://scikit-learn.org/stable/modules/grid_search.html#grid-search
   -
   http://scikit-learn.org/stable/modules/grid_search.html#gridsearch-scoring


2016-05-12 17:05 GMT+01:00 Amita Misra :

> Hi,
>
> I have a limited dataset and hence want  to learn the parameters and also
> evaluate the final model.
> From the documents it looks that nested cross validation is the way to do
> it. I have the code but still I want to be sure that I am not overfitting
> any way.
>
> pipeline=Pipeline([('scale', preprocessing.StandardScaler()),('filter',
> SelectKBest(f_regression)),('svr', svm.SVR())]
> C_range = [0.1, 1, 10, 100]
> gamma_range=numpy.logspace(-2, 2, 5)
> param_grid=[{'svr__kernel': ['rbf'], 'svr__gamma': gamma_range,'svr__C':
> C_range}]
> grid_search = GridSearchCV(pipeline, param_grid=param_grid,cv=5)
> Y_pred=cross_validation.cross_val_predict(grid_search, X_train,
> Y_train,cv=10)
>
> correlation=  numpy.ma.corrcoef(Y_train,Y_pred)[0, 1]
>
>
> please let me know if my understanding is correct.
>
> This is 10*5 nested cross validation. Inner folds CV over training data
> involves a grid search over hyperparameters and outer folds evaluate the
> performance.
>
>
> Thanks,
> Amita--
> Amita Misra
> Graduate Student Researcher
> Natural Language and Dialogue Systems Lab
> Baskin School of Engineering
> University of California Santa Cruz
>
>
>
> --
> Mobile security can be enabling, not merely restricting. Employees who
> bring their own devices (BYOD) to work are irked by the imposition of MDM
> restrictions. Mobile Device Manager Plus allows you to control only the
> apps on BYO-devices by containerizing them, leaving personal data
> untouched!
> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
Yours sincerely,
Alexey A. Dral
--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] gridsearchCV - overfitting

2016-05-12 Thread Josh Vredevoogd
Another point of confusion:
You shouldn't be using clf.predict() to calculate roc auc, you need
clf.predict_proba(). Roc is a measure of sorting and predict only gives you
the predicted class, not the probability, so the roc "curve" can only have
points at 0 and 1 instead of any probability in between.

On Thu, May 12, 2016 at 9:15 AM, Andreas Mueller  wrote:

> How did you evaluate on the development set?
> You should use "best_score_", not grid_search.score.
>
>
> On 05/12/2016 08:07 AM, A neuman wrote:
>
> thats actually what i did.
>
> and the difference is way to big.
>
> Should I do it withlout gridsearchCV? I'm just wondering why gridsearch
> giving me overfitted values. I know that these are the best params and so
> on... but i thought i can skip the manual part where i test the params on
> my own.  GridsearchCV give me just one pool of params, if they are
> overfitting, i cant use gridsearchCV?  Just having problems to understand
> this.
>
>
>
> On 12 May 2016 at 13:45, Olivier Grisel  wrote:
>
>> 2016-05-12 13:02 GMT+02:00 A neuman < 
>> themagenta...@gmail.com>:
>> > Thanks for the answer!
>> >
>> > but how should i check that its overfitted or not?
>>
>> Do a development / evaluation split of your dataset, for instance with
>> the train_test_split utility first. Then train your GridSearchCV model
>> on the development set and evaluate it both on the development set and
>> on the evaluation set. If the difference is large it means that you
>> are overfittng.
>>
>> --
>> Olivier
>>
>>
>> --
>> Mobile security can be enabling, not merely restricting. Employees who
>> bring their own devices (BYOD) to work are irked by the imposition of MDM
>> restrictions. Mobile Device Manager Plus allows you to control only the
>> apps on BYO-devices by containerizing them, leaving personal data
>> untouched!
>> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
>> ___
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> --
> Mobile security can be enabling, not merely restricting. Employees who
> bring their own devices (BYOD) to work are irked by the imposition of MDM
> restrictions. Mobile Device Manager Plus allows you to control only the
> apps on BYO-devices by containerizing them, leaving personal data 
> untouched!https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
>
>
>
> ___
> Scikit-learn-general mailing 
> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> --
> Mobile security can be enabling, not merely restricting. Employees who
> bring their own devices (BYOD) to work are irked by the imposition of MDM
> restrictions. Mobile Device Manager Plus allows you to control only the
> apps on BYO-devices by containerizing them, leaving personal data
> untouched!
> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] nested cross validation to get unbiased results

2016-05-12 Thread Amita Misra
Hi,

I have a limited dataset and hence want  to learn the parameters and also
evaluate the final model.
>From the documents it looks that nested cross validation is the way to do
it. I have the code but still I want to be sure that I am not overfitting
any way.

pipeline=Pipeline([('scale', preprocessing.StandardScaler()),('filter',
SelectKBest(f_regression)),('svr', svm.SVR())]
C_range = [0.1, 1, 10, 100]
gamma_range=numpy.logspace(-2, 2, 5)
param_grid=[{'svr__kernel': ['rbf'], 'svr__gamma': gamma_range,'svr__C':
C_range}]
grid_search = GridSearchCV(pipeline, param_grid=param_grid,cv=5)
Y_pred=cross_validation.cross_val_predict(grid_search, X_train,
Y_train,cv=10)

correlation=  numpy.ma.corrcoef(Y_train,Y_pred)[0, 1]


please let me know if my understanding is correct.

This is 10*5 nested cross validation. Inner folds CV over training data
involves a grid search over hyperparameters and outer folds evaluate the
performance.


Thanks,
Amita--
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz
--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] DPGMM applied to 1-dimensional vector and variance problem

2016-05-12 Thread Andreas Mueller

Hi Johan.
Unfortunately there are known problems with DPGMM 
https://github.com/scikit-learn/scikit-learn/issues/2454

There is a PR to reimplement:
https://github.com/scikit-learn/scikit-learn/pull/4802
I didn't know about dpcluster, it seems unmaintained. But maybe 
something to compare against?


Andy

On 05/05/2016 11:16 PM, Johan Mazel wrote:

Hello
Sorry for the double post, I think there was a problem with my 
previous message.


I am trying to use the DPGMM technique 
(http://scikit-learn.org/stable/modules/generated/sklearn.mixture.DPGMM.html) 
to find classes of data in 1-dimensional vectors.


The extracted variance/standard devaition seems a bit off: either 
underestimated or overestimated compared to dpcluster 
(https://github.com/teodor-moldovan/dpcluster).
I wrote a small script that give two examples and attached it to this 
mail.

Please tell me if I am doing anything wrong.

Thank you very much for your time.

Regards,
Johan


--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] gridsearchCV - overfitting

2016-05-12 Thread Andreas Mueller

How did you evaluate on the development set?
You should use "best_score_", not grid_search.score.

On 05/12/2016 08:07 AM, A neuman wrote:

thats actually what i did.

and the difference is way to big.

Should I do it withlout gridsearchCV? I'm just wondering why 
gridsearch giving me overfitted values. I know that these are the best 
params and so on... but i thought i can skip the manual part where i 
test the params on my own.  GridsearchCV give me just one pool of 
params, if they are overfitting, i cant use gridsearchCV?  Just having 
problems to understand this.




On 12 May 2016 at 13:45, Olivier Grisel > wrote:


2016-05-12 13:02 GMT+02:00 A neuman >:
> Thanks for the answer!
>
> but how should i check that its overfitted or not?

Do a development / evaluation split of your dataset, for instance with
the train_test_split utility first. Then train your GridSearchCV model
on the development set and evaluate it both on the development set and
on the evaluation set. If the difference is large it means that you
are overfittng.

--
Olivier


--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition
of MDM
restrictions. Mobile Device Manager Plus allows you to control
only the
apps on BYO-devices by containerizing them, leaving personal data
untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] gridsearchCV - overfitting

2016-05-12 Thread A neuman
thats actually what i did.

and the difference is way to big.

Should I do it withlout gridsearchCV? I'm just wondering why gridsearch
giving me overfitted values. I know that these are the best params and so
on... but i thought i can skip the manual part where i test the params on
my own.  GridsearchCV give me just one pool of params, if they are
overfitting, i cant use gridsearchCV?  Just having problems to understand
this.



On 12 May 2016 at 13:45, Olivier Grisel  wrote:

> 2016-05-12 13:02 GMT+02:00 A neuman :
> > Thanks for the answer!
> >
> > but how should i check that its overfitted or not?
>
> Do a development / evaluation split of your dataset, for instance with
> the train_test_split utility first. Then train your GridSearchCV model
> on the development set and evaluate it both on the development set and
> on the evaluation set. If the difference is large it means that you
> are overfittng.
>
> --
> Olivier
>
>
> --
> Mobile security can be enabling, not merely restricting. Employees who
> bring their own devices (BYOD) to work are irked by the imposition of MDM
> restrictions. Mobile Device Manager Plus allows you to control only the
> apps on BYO-devices by containerizing them, leaving personal data
> untouched!
> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] gridsearchCV - overfitting

2016-05-12 Thread Olivier Grisel
2016-05-12 13:02 GMT+02:00 A neuman :
> Thanks for the answer!
>
> but how should i check that its overfitted or not?

Do a development / evaluation split of your dataset, for instance with
the train_test_split utility first. Then train your GridSearchCV model
on the development set and evaluate it both on the development set and
on the evaluation set. If the difference is large it means that you
are overfittng.

-- 
Olivier

--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] gridsearchCV - overfitting

2016-05-12 Thread Joel Nothman
This would be much clearer if you provided some code, but I think I get
what you're saying.

The final GridSearchCV model is trained on the full training set, so the
fact that it perfectly fits that data with random forests is not altogether
surprising. What you can say about the parameters is that they are also the
best parameters (among those searched) for the RF classifier to predict the
held-out samples under cross-validation.

On 12 May 2016 at 19:53, A neuman  wrote:

> Hello everyone,
>
> I'm having a bit trouble with the parameters that I've got from
> gridsearchCV.
>
>
> For example:
>
> If i'm using the parameter what i've got from grid seardh CV for example
> on RF oder k-nn and i test the model on the train set, i get everytime an
> AUC value about 1.00 or 0.99.
> The dataset have 1200 samples.
>
> Does that mean that i can't use the Parameters that i've got from the
> gridsearchCV? Cause it was in actually every case. I already tried the
> nested-CV to compare the algorithms.
>
>
> example for RF with the values i have got from gridsearchCV (10-fold):
>
> RandomForestClassifier(n_estimators=200,oob_score=True,max_features=None,random_state=1,min_samples_leaf=
> 2,class_weight='balanced_subsample')
>
>
> and then i'm just using*clf.predict(X_train) *and test it on the*
> y_train set. *
>
> the AUC-value from the  clf.predict(X_test)  are about 0.73, so there is a
> big difference from  the train and test dataset.
>
> best,
>
>
> --
> Mobile security can be enabling, not merely restricting. Employees who
> bring their own devices (BYOD) to work are irked by the imposition of MDM
> restrictions. Mobile Device Manager Plus allows you to control only the
> apps on BYO-devices by containerizing them, leaving personal data
> untouched!
> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] gridsearchCV - overfitting

2016-05-12 Thread A neuman
Hello everyone,

I'm having a bit trouble with the parameters that I've got from
gridsearchCV.


For example:

If i'm using the parameter what i've got from grid seardh CV for example on
RF oder k-nn and i test the model on the train set, i get everytime an AUC
value about 1.00 or 0.99.
The dataset have 1200 samples.

Does that mean that i can't use the Parameters that i've got from the
gridsearchCV? Cause it was in actually every case. I already tried the
nested-CV to compare the algorithms.


example for RF with the values i have got from gridsearchCV (10-fold):

RandomForestClassifier(n_estimators=200,oob_score=True,max_features=None,random_state=1,min_samples_leaf=
2,class_weight='balanced_subsample')


and then i'm just using*clf.predict(X_train) *and test it on the*
y_train set. *

the AUC-value from the  clf.predict(X_test)  are about 0.73, so there is a
big difference from  the train and test dataset.

best,
--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general