[issue39218] Assertion failure when calling statistics.variance() on a float32 Numpy array

2021-08-30 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39218] Assertion failure when calling statistics.variance() on a float32 Numpy array

2021-08-30 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset 793f55bde9b0299100c12ddb0e6949c6eb4d85e5 by Raymond Hettinger in 
branch 'main':
bpo-39218: Improve accuracy of variance calculation (GH-27960)
https://github.com/python/cpython/commit/793f55bde9b0299100c12ddb0e6949c6eb4d85e5


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39218] Assertion failure when calling statistics.variance() on a float32 Numpy array

2021-08-26 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

> what it's correcting for is an inaccurate value of "c" [...]

I'll leave the logic as-is and just add a note about what is being corrected.

> Numerically, it's probably not helpful.

To make a difference, the mean would have to have huge magnitude relative to 
the variance; otherwise, squaring the error would drown it out to zero.

The mean() should already be accurate to within a 1/2 ulp.  The summation and 
division are exact.  There is only a single rounding when the result converts 
from Fraction to a float or decimal.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39218] Assertion failure when calling statistics.variance() on a float32 Numpy array

2021-08-26 Thread Mark Dickinson


Mark Dickinson  added the comment:

> what it's correcting for is an inaccurate value of "c" [...]

In more detail:

Suppose "m" is the true mean of the x in data, but all we have is an 
approximate mean "c" to work with. Write "e" for the error in that 
approximation, so that c = m + e. Then (using Python notation, but treating the 
expressions as exact mathematical expressions computed in the reals):

   sum((x-c)**2 for x in data)

== sum((x-m-e)**2 for x in data)

== sum((x - m)**2 for x in data) - 2 * sum((x - m)*e for x in data)
 + sum(e**2 for x in data)

== sum((x - m)**2 for x in data) - 2 * e * sum((x - m) for x in data)
 + sum(e**2 for x in data)

== sum((x - m)**2 for x in data) + sum(e**2 for x in data)
   (because sum((x - m) for x in data) is 0)

== sum((x - m)**2 for x in data) + n*e**2

So the error in our result arising from the error in computing m is that n*e**2 
term. And that's the term that's being subtracted here, because

   sum(x - c for x in data) ** 2 / n
== sum(x - m - e for x in data) ** 2 / n
== (sum(x - m for x in data) - sum(e for x in data))**2 / n
== (0 - n * e)**2 / n
== n * e**2

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39218] Assertion failure when calling statistics.variance() on a float32 Numpy array

2021-08-26 Thread Mark Dickinson


Mark Dickinson  added the comment:

> The rounding correction in _ss() looks mathematically incorrect to me [...]

I don't think it was intended as a rounding correction - I think it's just 
computing the variance (prior to the division by n or n-1) of the `(x - c)` 
terms using the standard "expectation of x^2 - (expectation of x)^2" formula:

  sum((x - c)**2 for x in data) - (sum(x - c for x in data)**2) / n

So I guess it *can* be thought of as a rounding correction, but what it's 
correcting for is an inaccurate value of "c"; it's not correcting for 
inaccuracies in the subtraction results. That is, if you were to add an 
artificial error into c at some point before computing "total" and "total2", 
that correction term should take you back to something approaching the true sum 
of squares of deviations.

So mathematically, I think it's correct, but not useful, because mathematically 
"total2" will be zero. Numerically, it's probably not helpful.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39218] Assertion failure when calling statistics.variance() on a float32 Numpy array

2021-08-25 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
keywords: +patch
pull_requests: +26406
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/27960

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39218] Assertion failure when calling statistics.variance() on a float32 Numpy array

2021-08-25 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

The rounding correction in _ss() looks mathematically incorrect to me:

   ∑ (xᵢ - x̅ + εᵢ)² = ∑ (xᵢ - x̅)² - (∑ εᵢ)² ÷ n

If we drop this logic (which seems completely bogus), all the tests still pass 
and the code becomes cleaner:

def _ss(data, c=None):
if c is None:
c = mean(data)
T, total, count = _sum((y := x - c) * y for x in data)
return (T, total)


-- Algebraic form of the current code --

from sympy import symbols, simplify

x1, x2, x3, e1, e2, e3 = symbols('x1 x2 x3 e1 e2 e3')
n = 3

# high accuracy mean
c = (x1 + x2 + x3) / n

# sum of squared deviations with subtraction errors
total = (x1 - c + e1)**2 + (x2 - c + e2)**2 + (x3 - c + e3)**2

# sum of subtraction errors = e1 + e2 + e3
total2 = (x1 - c + e1) + (x2 - c + e2) + (x3 - c + e3)

# corrected sum of squared deviations
total -= total2 ** 2 / n

# exact sum of squared deviations
desired = (x1 - c)**2 + (x2 - c)**2 + (x3 - c)**2

# expected versus actual
print(simplify(desired - total))

This gives:

(e1 + e2 + e3)**2/3
+ (-2*x1 + x2 + x3)**2/9
+ (x1 - 2*x2 + x3)**2/9
+ (x1 + x2 - 2*x3)**2/9
- (3*e1 + 2*x1 - x2 - x3)**2/9
- (3*e2 - x1 + 2*x2 - x3)**2/9
- (3*e3 - x1 - x2 + 2*x3)**2/9

-- Substituting in concrete values --

x1, x2, x3, e1, e2, e3 = 11, 17, 5, 0.3, 0.1, -0.2

This gives:

75.740001  uncorrected total
75.726667  "corrected" total
72.0   desired result

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39218] Assertion failure when calling statistics.variance() on a float32 Numpy array

2021-08-25 Thread Tal Einat


Change by Tal Einat :


--
nosy:  -taleinat

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39218] Assertion failure when calling statistics.variance() on a float32 Numpy array

2021-08-20 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

Removing the assertion and implementing Steven's idea seems like the best way 
to go:

sum((y:=(x-c)) * y for x in data)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39218] Assertion failure when calling statistics.variance() on a float32 Numpy array

2021-08-20 Thread Irit Katriel


Irit Katriel  added the comment:

I've reproduced this on 3.9 and 3.10. This part of the code in main is still 
the same, so the issue is probably there even though we don't have numpy with 
which to test.

--
nosy: +iritkatriel
versions: +Python 3.10, Python 3.11, Python 3.9 -Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39218] Assertion failure when calling statistics.variance() on a float32 Numpy array

2020-01-05 Thread Reed


Reed  added the comment:

Thank you all for the comments! Either using (x-c)*(x-c), or removing the 
assertion and changing the final line to `return (U, total)`, seem reasonable. 
I slightly prefer the latter case, due to Mark's comments about x*x being 
faster and simpler than x**2. But I am not an expert on this.

> I am inclined to have the stdev of float32 return a float32 is possible. What 
> do you think?

Agreed.

> OTOH, (x-c)*(x-c) repeats the subtraction unnecessarily, but perhaps 
> assignment expressions could rescue us?

Yeah, we should avoid repeating the subtraction. Another method of doing so is 
to define a square function. For example:

def square(y):
return y*y
sum(square(x-c) for x in data)

> Would that also imply intermediate calculations being performed only with 
> float32, or would intermediate calculations be performed with a more precise 
> type?

Currently, statistics.py computes sums in infinite precision 
(https://github.com/python/cpython/blob/422ed16fb846eec0b5b2a4eb3a978c9862615665/Lib/statistics.py#L123)
 for any type. The multiplications (and exponents if we go that route) would 
still be float32.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39218] Assertion failure when calling statistics.variance() on a float32 Numpy array

2020-01-05 Thread Mark Dickinson


Mark Dickinson  added the comment:

[Karthikeyan]

> can possibly break again if (x-c) * (x-c) was also changed to return float64 
> in future

I think it's safe to assume that multiplying two NumPy float32's will continue 
to give a float32 back in the future; NumPy has no reason to give back a 
different type, and changing this would be a big breaking change.

The big difference between (x-c)**2 and (x-c)*(x-c) in this respect is that the 
latter is purely an operation on float32 operands, while the former is a 
mixed-type operation: NumPy sees a binary operation between a float32 and a 
Python int, and has to figure out a suitable type for the result. The int to 
float32 conversion is regarded as introducing unacceptable precision loss, so 
it chooses float64 as an acceptable output type, converts both input operands 
to that type, and then does the computation.

Some relevant parts of the NumPy docs:

https://docs.scipy.org/doc/numpy/reference/ufuncs.html#casting-rules
https://docs.scipy.org/doc/numpy/reference/generated/numpy.result_type.html

Even for pure Python floats, x*x is a simpler, more accurate, and likely faster 
operation than x**2. On a typical system, the former (eventually) maps to a 
hardware floating-point multiplication, which is highly likely to be correctly 
rounded, while the latter, after conversion of the r.h.s. to a float, maps to a 
libm pow call. That libm pow call *could* conceivably have a fast/accurate path 
for a right-hand-side of 2.0, but it could equally conceivably not have such a 
path.

OTOH, (x-c)*(x-c) repeats the subtraction unnecessarily, but perhaps assignment 
expressions could rescue us? For example, replace:

sum((x-c)**2 for x in data)

with:

sum((y:=(x-c))*y for x in data)

[Steven]

> I am inclined to have the stdev of float32 return a float32 is possible. 

Would that also imply intermediate calculations being performed only with 
float32, or would intermediate calculations be performed with a more precise 
type? float32 has small enough precision to run into real accuracy problems 
with modestly large datasets. For example:

>>> import numpy as np
>>> sum(np.ones(10**8, dtype=np.float32), start=np.float32(0))
16777216.0  # should be 10**8

or, less dramatically:

>>> sum(np.full(10**6, 1.73, dtype=np.float32), start=np.float32(0)) / 10**6
1.74242125  # only two accurate significant figures

--
nosy: +mark.dickinson

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39218] Assertion failure when calling statistics.variance() on a float32 Numpy array

2020-01-05 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

Nice analysis and bug report, thank you! That's pretty strange behaviour for 
float32, but I guess we're stuck with it.

I wonder if the type assertion has outlived its usefulness? I.e. drop the `T == 
U` part and change the assertion to `assert count == count2` only.

If we removed the failing part of the assertion, and changed the final line to 
`return (U, total)`, that ought to keep the exact sum but convert to float32 
later, rather than float64.

I am inclined to have the stdev of float32 return a float32 is possible. What 
do you think?

We should check the numpy docs to see what the conversion rules for numpy 
floats are.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39218] Assertion failure when calling statistics.variance() on a float32 Numpy array

2020-01-04 Thread Karthikeyan Singaravelan


Karthikeyan Singaravelan  added the comment:

I think it's more of an implementation artifact of numpy eq definition for 
float32 and float64 and can possibly break again if (x-c) * (x-c) was also 
changed to return float64 in future.

--
nosy: +rhettinger, steven.daprano, taleinat, xtreak

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39218] Assertion failure when calling statistics.variance() on a float32 Numpy array

2020-01-04 Thread Reed


New submission from Reed :

If a float32 Numpy array is passed to statistics.variance(), an assertion 
failure occurs. For example:

import statistics
import numpy as np
x = np.array([1, 2], dtype=np.float32)
statistics.variance(x)

The assertion error is:

assert T == U and count == count2

Even if you convert x to a list with `x = list(x)`, the issue still occurs. The 
issue is caused by the following lines in statistics.py 
(https://github.com/python/cpython/blob/ec007cb43faf5f33d06efbc28152c7fdcb2edb9c/Lib/statistics.py#L687-L691):

T, total, count = _sum((x-c)**2 for x in data)
# The following sum should mathematically equal zero, but due to rounding
# error may not.
U, total2, count2 = _sum((x-c) for x in data)
assert T == U and count == count2

When a float32 Numpy value is squared in the term (x-c)**2, it turns into a 
float64 value, causing the `T == U` assertion to fail. I think the best way to 
fix this would be to replace (x-c)**2 with (x-c)*(x-c). This fix would no 
longer assume the input's ** operator returns the same type.

--
components: Library (Lib)
messages: 359323
nosy: reed
priority: normal
severity: normal
status: open
title: Assertion failure when calling statistics.variance() on a float32 Numpy 
array
type: behavior
versions: Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com