> > As far as I understand: Holding out a test set is recommended if you > aren't entirely sure that the assumptions of the model are held (gaussian > error on a linear fit; independent and identically distributed samples). > The model evaluation approach in predictive ML, using held-out data, relies > only on the weaker assumption that the metric you have chosen, when applied > to the test set you have held out, forms a reasonable measure of > generalised / real-world performance. (Of course this too is often not held > in practice, but it is the primary assumption, in my opinion, that ML > practitioners need to be careful of.) >
Dear CW, As Joel as said, holding out a test set will help you evaluate the validity of model assumptions, and his last point (reasonable measure of generalised performance) is absolutely essential for understanding the capabilities and limitations of ML. To add to your checklist of interpreting ML papers properly, be cautious when interpreting reports of high performance when using 5/10-fold or Leave-One-Out cross-validation on large datasets, where "large" depends on the nature of the problem setting. Results are also highly dependent on the distributions of the underlying independent variables (e.g., 60000 datapoints all with near-identical distributions may yield phenomenal performance in cross validation and be almost non-predictive in truly unknown/prospective situations). Even at 500 datapoints, if independent variable distributions look similar (with similar endpoints), then when each model is trained on 80% of that data, the remaining 20% will certainly be predictable, and repeating that five times will yield statistics that seem impressive. So, again, while problem context completely dictates ML experiment design, metric selection, and interpretation of outcome, my personal rule of thumb is to do no-more than 2-fold cross-validation (50% train, 50% predict) when having 100+ datapoints. Even more extreme, using try 33% for training and 66% for validation (or even 20/80). If your model still reports good statistics, then you can believe that the patterns in the training data extrapolate well to the ones in the external validation data. Hope this helps, J.B.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn