Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

2019-06-04 Thread Brown J.B. via scikit-learn
Dear CW,


> Linear regression is not a black-box. I view prediction accuracy as an
> overkill on interpretable models. Especially when you can use R-squared,
> coefficient significance, etc.
>

Following on my previous note about being cautious with cross-validated
evaluation for classification, the same applies for regression.
About 20 years ago, chemoinformatics researchers pointed out the caution
needed with using CV-based R^2 (q^2) as a measure of performance.
"Beware of q2!"  Golbraikh and Tropsha, J Mol Graph Modeling (2002) 20:269
https://www.sciencedirect.com/science/article/pii/S1093326301001231

In this article, they propose to measure correlation by using both
known-VS-predicted _and_ predicted-VS-known calculations of the correlation
coefficient, and importantly, that the regression line to fit in both cases
goes through the origin.
The resulting coefficients are checked as a pair, and the authors argue
that only if they are both high can one say that the model is fitting the
data well.

Contrast this to Pearson Product Moment Correlation (R), where the fit of
the line has no requirement to go through the origin of the fit.

I found the paper above to be helpful in filtering for more robust
regression models, and have implemented my own version of their method,
which I use as my first evaluation metric when performing regression
modelling.

Hope this provides you some thought.

Prediction accuracy also does not tell you which feature is important.
>

The contributions of the scikit-learn community have yielded a great set of
tools for performing feature weighting separate from model performance
evaluation.
All you need to do is read the documentation and try out some of the
examples, and you should be ready to adapt to your situation.

J.B.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

2019-06-04 Thread C W
Thank you all for the replies.

I agree that prediction accuracy is great for evaluating black-box ML
models. Especially advanced models like neural networks, or not-so-black
models like LASSO, because they are NP-hard to solve.

Linear regression is not a black-box. I view prediction accuracy as an
overkill on interpretable models. Especially when you can use R-squared,
coefficient significance, etc.

Prediction accuracy also does not tell you which feature is important.

What do you guys think? Thank you!

.

On Mon, Jun 3, 2019 at 11:43 AM Andreas Mueller  wrote:

> This classical paper on statistical practices (Breiman's "two cultures")
> might be helpful to understand the different viewpoints:
>
> https://projecteuclid.org/euclid.ss/1009213726
>
>
> On 6/3/19 12:19 AM, Brown J.B. via scikit-learn wrote:
>
> As far as I understand: Holding out a test set is recommended if you
>> aren't entirely sure that the assumptions of the model are held (gaussian
>> error on a linear fit; independent and identically distributed samples).
>> The model evaluation approach in predictive ML, using held-out data, relies
>> only on the weaker assumption that the metric you have chosen, when applied
>> to the test set you have held out, forms a reasonable measure of
>> generalised / real-world performance. (Of course this too is often not held
>> in practice, but it is the primary assumption, in my opinion, that ML
>> practitioners need to be careful of.)
>>
>
> Dear CW,
> As Joel as said, holding out a test set will help you evaluate the
> validity of model assumptions, and his last point (reasonable measure of
> generalised performance) is absolutely essential for understanding the
> capabilities and limitations of ML.
>
> To add to your checklist of interpreting ML papers properly, be cautious
> when interpreting reports of high performance when using 5/10-fold or
> Leave-One-Out cross-validation on large datasets, where "large" depends on
> the nature of the problem setting.
> Results are also highly dependent on the distributions of the underlying
> independent variables (e.g., 6 datapoints all with near-identical
> distributions may yield phenomenal performance in cross validation and be
> almost non-predictive in truly unknown/prospective situations).
> Even at 500 datapoints, if independent variable distributions look similar
> (with similar endpoints), then when each model is trained on 80% of that
> data, the remaining 20% will certainly be predictable, and repeating that
> five times will yield statistics that seem impressive.
>
> So, again, while problem context completely dictates ML experiment design,
> metric selection, and interpretation of outcome, my personal rule of thumb
> is to do no-more than 2-fold cross-validation (50% train, 50% predict) when
> having 100+ datapoints.
> Even more extreme, using try 33% for training and 66% for validation (or
> even 20/80).
> If your model still reports good statistics, then you can believe that the
> patterns in the training data extrapolate well to the ones in the external
> validation data.
>
> Hope this helps,
> J.B.
>
>
>
>
> ___
> scikit-learn mailing 
> listscikit-learn@python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn