Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

2019-06-05 Thread Andreas Mueller



On 6/4/19 8:44 PM, C W wrote:

Thank you all for the replies.

I agree that prediction accuracy is great for evaluating black-box ML 
models. Especially advanced models like neural networks, or 
not-so-black models like LASSO, because they are NP-hard to solve.


Linear regression is not a black-box. I view prediction accuracy as an 
overkill on interpretable models. Especially when you can use 
R-squared, coefficient significance, etc.


Prediction accuracy also does not tell you which feature is important.

What do you guys think? Thank you!


Did you read the paper that I sent? ;)
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Any way to tune threshold of Birch rather than GridSearchCV?

2019-06-05 Thread lampahome
I use Birch to cluster my data and my data is kind of time-series data.

I don't know the actually cluster numbers and need to read large
data(online learning), so I choose Birch rather than MiniKmeans.

When I read it, I found the critical parameters might be branching_factor
and threshold, and threshold will affect my cluster numbers obviously!

Any way to estimate the suitable threshold of Birch? Any paper suggestion
is ok.

thx
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

2019-06-05 Thread Matthew Brett
On Wed, Jun 5, 2019 at 8:18 AM Brown J.B. via scikit-learn
 wrote:
>
> 2019年6月5日(水) 10:43 Brown J.B. :
>>
>> Contrast this to Pearson Product Moment Correlation (R), where the fit of 
>> the line has no requirement to go through the origin of the fit.
>
>
> Not sure what I was thinking when I wrote that.
> Pardon the mistake; I'm fully aware that Pearson R is merely a coefficient 
> merely indicating direction of trend.

Ah - now I'm more confused.  r is surely a coefficient, but I
personally find it most useful to think of r as the least-squares
regression slope once the x and y values have been transformed to
standard scores.  For that case, the least-squares intercept must be
0.

Cheers,

Matthew
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

2019-06-05 Thread Brown J.B. via scikit-learn
2019年6月5日(水) 10:43 Brown J.B. :

> Contrast this to Pearson Product Moment Correlation (R), where the fit of
> the line has no requirement to go through the origin of the fit.
>

Not sure what I was thinking when I wrote that.
Pardon the mistake; I'm fully aware that Pearson R is merely a coefficient
merely indicating direction of trend.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

2019-06-05 Thread Matthieu Brucher
Hi CW,

It's not about the concept of the black box, none of the algorithms in
sklearn are a blackbox. The question is about model validity. Is linear
regression a valid representation of your data? That's what the train/test
answers. You may think so, but only this process will answer it properly.

Matthieu

Le mer. 5 juin 2019 à 01:46, C W  a écrit :

> Thank you all for the replies.
>
> I agree that prediction accuracy is great for evaluating black-box ML
> models. Especially advanced models like neural networks, or not-so-black
> models like LASSO, because they are NP-hard to solve.
>
> Linear regression is not a black-box. I view prediction accuracy as an
> overkill on interpretable models. Especially when you can use R-squared,
> coefficient significance, etc.
>
> Prediction accuracy also does not tell you which feature is important.
>
> What do you guys think? Thank you!
>
> .
>
> On Mon, Jun 3, 2019 at 11:43 AM Andreas Mueller  wrote:
>
>> This classical paper on statistical practices (Breiman's "two cultures")
>> might be helpful to understand the different viewpoints:
>>
>> https://projecteuclid.org/euclid.ss/1009213726
>>
>>
>> On 6/3/19 12:19 AM, Brown J.B. via scikit-learn wrote:
>>
>> As far as I understand: Holding out a test set is recommended if you
>>> aren't entirely sure that the assumptions of the model are held (gaussian
>>> error on a linear fit; independent and identically distributed samples).
>>> The model evaluation approach in predictive ML, using held-out data, relies
>>> only on the weaker assumption that the metric you have chosen, when applied
>>> to the test set you have held out, forms a reasonable measure of
>>> generalised / real-world performance. (Of course this too is often not held
>>> in practice, but it is the primary assumption, in my opinion, that ML
>>> practitioners need to be careful of.)
>>>
>>
>> Dear CW,
>> As Joel as said, holding out a test set will help you evaluate the
>> validity of model assumptions, and his last point (reasonable measure of
>> generalised performance) is absolutely essential for understanding the
>> capabilities and limitations of ML.
>>
>> To add to your checklist of interpreting ML papers properly, be cautious
>> when interpreting reports of high performance when using 5/10-fold or
>> Leave-One-Out cross-validation on large datasets, where "large" depends on
>> the nature of the problem setting.
>> Results are also highly dependent on the distributions of the underlying
>> independent variables (e.g., 6 datapoints all with near-identical
>> distributions may yield phenomenal performance in cross validation and be
>> almost non-predictive in truly unknown/prospective situations).
>> Even at 500 datapoints, if independent variable distributions look
>> similar (with similar endpoints), then when each model is trained on 80% of
>> that data, the remaining 20% will certainly be predictable, and repeating
>> that five times will yield statistics that seem impressive.
>>
>> So, again, while problem context completely dictates ML experiment
>> design, metric selection, and interpretation of outcome, my personal rule
>> of thumb is to do no-more than 2-fold cross-validation (50% train, 50%
>> predict) when having 100+ datapoints.
>> Even more extreme, using try 33% for training and 66% for validation (or
>> even 20/80).
>> If your model still reports good statistics, then you can believe that
>> the patterns in the training data extrapolate well to the ones in the
>> external validation data.
>>
>> Hope this helps,
>> J.B.
>>
>>
>>
>>
>> ___
>> scikit-learn mailing 
>> listscikit-learn@python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Quantitative researcher, Ph.D.
Blog: http://blog.audio-tk.com/
LinkedIn: http://www.linkedin.com/in/matthieubrucher
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn