Another point of confusion:
You shouldn't be using clf.predict() to calculate roc auc, you need
clf.predict_proba(). Roc is a measure of sorting and predict only gives you
the predicted class, not the probability, so the roc "curve" can only have
points at 0 and 1 instead of any probability in
Hi Johan.
Unfortunately there are known problems with DPGMM
https://github.com/scikit-learn/scikit-learn/issues/2454
There is a PR to reimplement:
https://github.com/scikit-learn/scikit-learn/pull/4802
I didn't know about dpcluster, it seems unmaintained. But maybe
something to compare
How did you evaluate on the development set?
You should use "best_score_", not grid_search.score.
On 05/12/2016 08:07 AM, A neuman wrote:
thats actually what i did.
and the difference is way to big.
Should I do it withlout gridsearchCV? I'm just wondering why
gridsearch giving me overfitted
Hi,
I have a limited dataset and hence want to learn the parameters and also
evaluate the final model.
>From the documents it looks that nested cross validation is the way to do
it. I have the code but still I want to be sure that I am not overfitting
any way.
pipeline=Pipeline([('scale',
Actually I do not have an independent test set and hence I want to use it
as an estimate for generalization performance. Hence my classifier is fixed
SVM and I want to learn the parameters and also estimate an unbiased
performance using only one set of data.
I wanted to ensure that my code
Hi Amita,
As far as I understand your question, you only need one CV loop to optimize
your objective with scoring function provided:
===
pipeline=Pipeline([('scale', preprocessing.StandardScaler()),('filter',
SelectKBest(f_regression)),('svr', svm.SVR())]
C_range = [0.1, 1, 10, 100]
I would say there are 2 different applications of nested CV. You could use it
for algorithm selection (with hyperparam tuning in the inner loop). Or, you
could use it as an estimate of the generalization performance (only hyperparam
tuning), which has been reported to be less biased than the a
I am not that much into the multi-processing implementation in scikit-learn /
joblib, but I think this could be one issue why your mac hangs… I’d say that
it’s probably the safest approach to only set the n_jobs parameter for the
innermost object.
E.g., if you 4 processors, you said the
Oh yes, I get that now. All this while I was thinking there was an issue
with the mac due to a similar issue discussed here
https://github.com/scikit-learn/scikit-learn/issues/5115.
Thanks a lot for clearing this up. I am going to change the loop and see
if I can run the parallel implementation
I had not thought about the n_jobs parameter, mainly because it does not
run on my mac and the system just hangs if i use it.
The same code runs on linux server though.
I have one more clarification to seek.
I was running it on server with this code. Would this be fine or may I move
the n_jobs=3
You are welcome, and I am glad to hear that it works :). And “your" approach is
definitely the cleaner way to do it … I think you just need to be a bit careful
about the n_jobs parameter in practice, I would only set it to n_jobs=-1 in the
inner loop.
Best,
Sebastian
> On May 12, 2016, at
Hello everyone,
I'm having a bit trouble with the parameters that I've got from
gridsearchCV.
For example:
If i'm using the parameter what i've got from grid seardh CV for example on
RF oder k-nn and i test the model on the train set, i get everytime an AUC
value about 1.00 or 0.99.
The
I see; that’s what I thought. At first glance, the approach (code) looks
correct to me but I haven’ t done it this way, yet. Typically, I use a more
“manual” approach iterating over the outer folds manually (since I typically
use nested CV for algo selection):
gs_est = … your gridsearch,
Thanks.
Actually there were 2 people running the same experiments and the other
person was doing as you have shown above.
We were getting the same results but since methods were different I wanted
to ensure that I am doing it the right way.
Thanks,
Amita
On Thu, May 12, 2016 at 2:43 PM,
This would be much clearer if you provided some code, but I think I get
what you're saying.
The final GridSearchCV model is trained on the full training set, so the
fact that it perfectly fits that data with random forests is not altogether
surprising. What you can say about the parameters is
2016-05-12 13:02 GMT+02:00 A neuman :
> Thanks for the answer!
>
> but how should i check that its overfitted or not?
Do a development / evaluation split of your dataset, for instance with
the train_test_split utility first. Then train your GridSearchCV model
on the
thats actually what i did.
and the difference is way to big.
Should I do it withlout gridsearchCV? I'm just wondering why gridsearch
giving me overfitted values. I know that these are the best params and so
on... but i thought i can skip the manual part where i test the params on
my own.
17 matches
Mail list logo