[issue45766] Add direct proportion option to statistics.linear_regression()

2021-11-21 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

Thanks for looking at this and giving it some good thought.

--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45766] Add direct proportion option to statistics.linear_regression()

2021-11-21 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset d2b55b07d2b503dcd3b5c0e2753efa835cff8e8f by Raymond Hettinger in 
branch 'main':
bpo-45766: Add direct proportion option to linear_regression(). (#29490)
https://github.com/python/cpython/commit/d2b55b07d2b503dcd3b5c0e2753efa835cff8e8f


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45766] Add direct proportion option to statistics.linear_regression()

2021-11-21 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

Hi Raymond,

I'm satisfied that this should be approved. The code looks good to me 
and in my tests it matches the results from other software.

I don't think there is any need to verify that plain OLS regression 
produces an intercept close to zero. (What counts as close to zero?) If 
users want to check that, they can do so themselves.

Regarding my concern with the coefficient of determination, I don't 
think that's enough of a problem that it should delay adding this 
functionality. I don't know what, if anything, should be done, but in 
the meantime we should approve this new feature.

For the record, an example of the problem can be seen on the last slide 
here:

https://www.azdhs.gov/documents/preparedness/state-laboratory/lab-licensure-certification/technical-resources/calibration-training/09-linear-forced-through-zero-calib.pdf

The computed r**2 of 1.0 is clearly too high for the RTO line.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45766] Add direct proportion option to statistics.linear_regression()

2021-11-11 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

It usually isn't wise to be preachy in the docs, but we could add a suggestion 
that proportional=True be used only when (0, 0) is known to be in the dataset 
and when it is in the same neighborhood as the other data points.  A reasonable 
cross-check would be to verify than a plain OLS regression would produce an 
intercept near zero.

linear_regression(hours_since_poll_started, number_of_respondents, 
proportional=True)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45766] Add direct proportion option to statistics.linear_regression()

2021-11-10 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

Sure, I’m happy to wait.

My thoughts:

* The first link you provided does give the same slope across packages.  Where 
they differ is in how they choose to report statistics for assessing goodness 
of fit or for informing hypothesis testing. Neither of those apply to us.

* The compared stats packages offer this functionality because some models 
don’t benefit from a non-zero constant. 

* The second link is of low quality and reads like hastily typed, stream of 
consciousness rant that roughly translates to “As a blanket statement 
applicable to all RTO, I don’t believe the underlying process is linear and I 
don’t believe that a person could have a priori knowledge of a directly 
proportional relationship.”  This is bunk — a cold caller makes sales in direct 
proportion to the number of calls they make, and zero calls means zero sales.

* The last point is a distractor.  Dealing with error analysis or input error 
models is beyond the scope of the package. Doing something I could easily do 
with my HP-12C is within scope. 

* We’re offering users something simple. If you have a need to fit a data to 
directly proportional model, set a flag.

* If we don’t offer the option, users have to do too much work to bridge from 
what we have to what they need:

   (covariance(x, y) + mean(x)*mean(y)) / (variance(x) + mean(x)**2)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45766] Add direct proportion option to statistics.linear_regression()

2021-11-10 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

Hi Raymond,

I'm conflicted by this. Regression through the origin is clearly a thing which 
is often desired. In that sense, I'm happy to see it added, and thank you.

But on the other hand, this may open a can of worms that I personally don't 
feel entirely competent to deal with. Are you happy to hold off a few days 
while I consult with some statistics experts?

- There is some uncertainty as to the correct method of calculation, with many 
stats packages giving different results for the same data, e.g.

https://web.ist.utl.pt/~ist11038/compute/errtheory/,regression/regrthroughorigin.pdf

- Forcing the intercept through the origin is a dubious thing to do, even if 
you think it is theoretically justified, see for example the above paper, also:

https://dynamicecology.wordpress.com/2017/04/13/dont-force-your-regression-through-zero-just-because-you-know-the-true-intercept-has-to-be-zero/

https://www.theanalysisfactor.com/regression-through-the-origin/

- Regression through the origin needs a revised calculation for the coefficient 
of determination (Pearson's R squared):

https://pubs.cif-ifc.org/doi/pdf/10.5558/tfc71326-3

https://www.researchgate.net/publication/28191_Re-interpreting_R-squared_regression_through_the_origin_and_weighted_least_squares

but it's not clear how to revise the calculation, with some methods giving R 
squared negative or greater than 1.

- Regression through the origin is only one of a number of variants of 
least-squares linear regression that we might also wish to offer, e.g. 
intercept-only, Deming or orthogonal regression.

https://en.wikipedia.org/wiki/Deming_regression

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45766] Add direct proportion option to statistics.linear_regression()

2021-11-09 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
keywords: +patch
pull_requests: +27741
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/29490

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45766] Add direct proportion option to statistics.linear_regression()

2021-11-09 Thread Raymond Hettinger


New submission from Raymond Hettinger :

Signature:

def linear_regression(x, y, /, *, proportional=False):

Additional docstring with example:

If *proportional* is true, the independent variable *x* and the
dependent variable *y* are assumed to be directly proportional.
The data is fit to a line passing through the origin.

Since the *intercept* will always be 0.0, the underlying linear
function simplifies to:

y = slope * x + noise

>>> y = [3 * x[i] + noise[i] for i in range(5)]
>>> linear_regression(x, y, proportional=True)  #doctest: +ELLIPSIS
LinearRegression(slope=3.0244754248461283, intercept=0.0)

See Wikipedia entry for regression without an intercept term:
https://en.wikipedia.org/wiki/Simple_linear_regression#Simple_linear_regression_without_the_intercept_term_(single_regressor)

Compare with the *const* parameter in MS Excel's linest() function:
https://support.microsoft.com/en-us/office/linest-function-84d7d0d9-6e50-4101-977a-fa7abf772b6d

Compare with the *IncludeConstantBasis* option in Mathematica:
https://reference.wolfram.com/language/ref/IncludeConstantBasis.html

--
components: Library (Lib)
messages: 406026
nosy: rhettinger, steven.daprano
priority: normal
severity: normal
status: open
title: Add direct proportion option to statistics.linear_regression()
versions: Python 3.11

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com