I would like to make a related suggestion but instead of focusing on the upper
bound for the number of trees rather set choosing the lower bound. From a
theoretical perspective, it doesn't make sense to me how fewer trees can result
in a better performing random forest model in terms of generalization
performance. If you observe a better performance on the same independent test
set with fewer trees, I would say that this is likely not a good indicator of
better generalization performance. It could be due to overfitting and
train/test set resampling and/or picking up artifacts in the dataset.
As a general suggestion, I would suggest choosing a reasonable number of trees
that seems computationally feasible given the size of the dataset and the
number hyperparameters to compare via model selection. Then, after tuning, I
would use the best hyperparameter setting with 10x more trees and see if you
notice any significant different in the cross-validation performance. Next, I
would use the model and fit it to the whole training set with those best
hyperparameters and evaluate the performance on the independent test set.
Best,
Sebastian
> On Dec 24, 2018, at 9:27 PM, Brown J.B. via scikit-learn
> wrote:
>
> Take random forest as example, if I give estimator from 10 to 1(10, 100,
> 1000, 1) into grid search.
> Based on the result, I found estimator=100 is the best, but I don't know
> lower or greater than 100 is better.
> How should I decide? brute force or any tools better than GridSearchCV?
>
> A simple but nonetheless practical solution is to
> (1) start with an upper bound on the number of trees you are willing to
> accept in the model,
> (2) obtain its performance (ACC, MCC, F1, etc) as the starting reference
> point,
> (3) systematically lower the number of trees (log2 scale down, fixed size
> decrement, etc)
> (4) obtain the reduced forest size performance,
> (5) Repeat (3)-(4) until [performance(reference) - performance(current
> forest size)] > tolerance
>
> You can encapsulate that in a function which then returns the final model you
> obtain.
> From the model object, the number of trees can be obtained.
>
> J.B.
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn