subject:"\[issue45876\] Improve accuracy of stdev functions in statistics"

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-30 Thread Raymond Hettinger



Raymond Hettinger  added the comment:


New changeset 0aa0bd056349f73de9577ccc38560c1d01864d51 by Raymond Hettinger in 
branch 'main':
bpo-45876:  Have stdev() also use decimal specific square root. (GH-29869)
https://github.com/python/cpython/commit/0aa0bd056349f73de9577ccc38560c1d01864d51


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-30 Thread Raymond Hettinger



Change by Raymond Hettinger :


--
pull_requests: +28095
pull_request: https://github.com/python/cpython/pull/29869

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-30 Thread Raymond Hettinger



Raymond Hettinger  added the comment:


New changeset a39f46afdead515e7ac3722464b5ee8d7b0b2c9b by Raymond Hettinger in 
branch 'main':
bpo-45876:  Correctly rounded stdev() and pstdev() for the Decimal case 
(GH-29828)
https://github.com/python/cpython/commit/a39f46afdead515e7ac3722464b5ee8d7b0b2c9b


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-28 Thread Raymond Hettinger



Change by Raymond Hettinger :


--
pull_requests: +28059
pull_request: https://github.com/python/cpython/pull/29828

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-26 Thread Raymond Hettinger



Raymond Hettinger  added the comment:


New changeset af9ee57b96cb872df6574e36027cc753417605f9 by Raymond Hettinger in 
branch 'main':
bpo-45876: Improve accuracy for stdev() and pstdev() in statistics (GH-29736)
https://github.com/python/cpython/commit/af9ee57b96cb872df6574e36027cc753417605f9


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-26 Thread Raymond Hettinger



Raymond Hettinger  added the comment:

Thank you all for looking at this.  It's unlikely that anyone will ever notice 
the improvement, but I'm happy with it and that's all the matters ;-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-26 Thread Raymond Hettinger



Change by Raymond Hettinger :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-26 Thread Steven D'Aprano



Steven D'Aprano  added the comment:

Raymond, Mark, Tim,

I have been reading this whole thread. Thank you all. I am in awe and a little 
bit intimidated by how much I still have to learn about floating point 
arithmetic.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-26 Thread Tim Peters



Tim Peters  added the comment:

But I would like to leave it alone. Extended precision simply is not an issue 
on any current platform I'm aware of ("not even Windows"), and I would, e.g., 
hate trying to explain to users why

1 / 2731 != 1.0 / 2731.0

(assuming we're not also proposing to take float division away from the HW). 
It's A Feature that

I / J == float(I) / float(J)

whenever I and J are both representable as floats.

If extended precision is an issue on some platform, fine, let them speak up. On 
x87 we could document that CPython assumes the FPU's "precision control" is set 
to 53 bits.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-26 Thread Raymond Hettinger



Raymond Hettinger  added the comment:

> It won't affect _this_ application, but possibly we should 
> fix this anyway.

I would like to see this fixed.  It affects our ability to
reason about int/int code.  That comes up every time a 
fraction is fed into a math library function than converts
its input to a float.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-26 Thread Mark Dickinson



Mark Dickinson  added the comment:

Concrete example of int/int not being correctly rounded on systems using x87 
instructions: on those systems, I'd expect to see 1/2731 return a result of 
0.00036616623947272064 (0x1.7ff4005ffd002p-12), as a result of first rounding 
to 64-bit precision and then to 53-bit. The correctly-rounded result is 
0.0003661662394727206 (0x1.7ff4005ffd001p-12).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-26 Thread Mark Dickinson



Mark Dickinson  added the comment:

> All the rounding has already happened at the point where ldexp is called, and 
> the result of the ldexp call is exact.

Sketch of proof:

[Here](https://github.com/python/cpython/blob/4ebde73b8e416eeb1fd5d2ca3283f7ddb534c5b1/Objects/longobject.c#L3929)
 we have:

shift = Py_MAX(diff, DBL_MIN_EXP) - DBL_MANT_DIG - 2;

from which (assuming IEEE 754 as usual) shift >= -1076. (DBL_MIN_EXP = -1021, 
DBL_MANT_DIG = 53)

[Here](https://github.com/python/cpython/blob/4ebde73b8e416eeb1fd5d2ca3283f7ddb534c5b1/Objects/longobject.c#L4008)
 we round away the last two or three bits of x, after which x is guaranteed to 
be a multiple of 4:

x->ob_digit[0] = low & ~(2U*mask-1U);

Then after converting the PyLong x to a double dx with exactly the same value 
[here](https://github.com/python/cpython/blob/4ebde73b8e416eeb1fd5d2ca3283f7ddb534c5b1/Objects/longobject.c#L4020)
 we make the ldexp call:

result = ldexp(dx, (int)shift);

At this point dx is a multiple of 4 and shift >= -1076, so the result of the 
ldexp scaling is a multiple of 2**-1074, and in the case of a subnormal result, 
it's already exactly representable.


For the int/int division possibly not being correctly rounded on x87, see 
[here](https://github.com/python/cpython/blob/4ebde73b8e416eeb1fd5d2ca3283f7ddb534c5b1/Objects/longobject.c#L3889-L3892).

It won't affect _this_ application, but possibly we should fix this anyway. 
Though the progression of time is already effectively fixing it for us, as x87 
becomes less and less relevant.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-26 Thread Tim Peters



Tim Peters  added the comment:

> Objects/longobject.c::long_true_divide() uses ldexp() internally.
> Will it suffer the same issues with subnormals on Windows?

Doesn't look like it will. In context, looks like it's ensuring that ldexp can 
only lose trailing 0 bits, so that _whatever_ ldexp does in the way of rounding 
is irrelevant. But it's not doing this because of Windows - it's to prevent 
"double-rounding" errors regardless of platform.

> Is CPython int/int true division guaranteed to be correctly rounded?

If there's some promise of that in the docs, I don't know where it is. But the 
code clearly intends to strive for correct rounding. Ironically, PEP 238 
guarantees that if it is correctly rounded, that's purely by accident ;-) :

"""
True division for ints and longs will convert the arguments to float and then 
apply a float division. That is, even 2/1 will return a float 
"""

But i/j is emphatically not implemented via float(i)/float(j).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-26 Thread Mark Dickinson



Mark Dickinson  added the comment:

> Will it suffer the same issues with subnormals on Windows?

No, it should be fine. All the rounding has already happened at the point where 
ldexp is called, and the result of the ldexp call is exact.

> Is CPython int/int true division guaranteed to be correctly rounded?

Funny you should ask. :-) There's certainly no documented guarantee, and there 
_is_ a case (documented in comments) where the current code may not return 
correctly rounded results on machines that use x87: there's a fast path where 
both numerator and denominator fit into an IEEE 754 double without rounding, 
and we then do a floating-point division.

But we can't hit that case with the proposed code, since the numerator will 
always have at least 55 bits, so the fast path is never taken.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-26 Thread Raymond Hettinger



Raymond Hettinger  added the comment:

[Tim]
> Note that, on Windows, ldexp() in the presence of 
> denorms can truncate. Division rounds, so
>
>assert x / 2**i == ldexp(x, -i)
>
> can fail.

Objects/longobject.c::long_true_divide() uses ldexp() internally.  Will it 
suffer the same issues with subnormals on Windows?  Is CPython int/int true 
division guaranteed to be correctly rounded?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-26 Thread Tim Peters



Tim Peters  added the comment:

Mark, ya, MS's Visual Studio's ldexp() has, far as I know, always worked this 
way. The code I showed was run under the 2019 edition, which we use to build 
the Windows CPython.

Raymond,

x = float(i)

is screamingly obvious at first glance.

x = i/1

looks like a coding error at first. The "reason" for different spellings in 
different branches looked obvious in the original: one branch needs to divide, 
and the other doesn't. So the original code was materially clearer to me.

Both, not sure it helps, but this use of round-to-odd appears akin to the 
decimal module's ROUND_05UP, which rounds an operation result in such a way 
that, if it's rounded again - under any rounding mode - to a narrower 
precision, you get the same narrower result as if you had used that rounding 
mode on the original operation to that narrower precision to begin with.

Decimal only needs to adjust the value of the last retained digit to, 
effectively, "encode" all possibilities, but binary needs two trailing bits. 
"Round" and "sticky" are great names for them :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-26 Thread Mark Dickinson



Mark Dickinson  added the comment:

Since I've failed to find a coherent statement and proof of the general 
principle I articulated anywhere online, I've included one below. (To be 
absolutely clear, the principle is not new - it's very well known, but oddly 
hard to find written down anywhere.)

--

Setup: introducing jots
===

*Notation.* R is the set of real numbers, Z is the set of integers, operators
like *, / and ^ have their mathematical interpretations, not Python ones.

Fix a precision p > 0 and an IEEE 754 binary floating-point format with
precision p; write F for the set of representable values in that format,
including zeros and infinities. (We don't need NaNs, but feel free to include
them if you want to.)

Let rnd : R -> F be the rounding function corresponding to any of the standard
IEEE 754 rounding modes. We're not ignoring overflow and underflow here: rnd is
assumed to round tiny values to +/-0 and large values to +/-infinity as normal.
(We only really care about round-ties-to-even, but all of the below is
perfectly general.)

*Definition.* For the given fixed precision p, a *jot* is a subinterval of the
positive reals of the form (m 2^e, (m+1) 2^e) for some integers m and e, with
m satisfying 2^p <= m < 2^(p+1).

This is a definition-of-convenience, invented purely for the purposes of this
proof. (And yes, the name is silly. Suggestions for a better name to replace
"jot" are welcome. Naming things is hard.)

We've chosen the size of a jot so that between each consecutive pair a and b of
positive normal floats in F, there are exactly two jots: one spanning from a to
the midpoint (a+b)/2, and another spanning from (a+b)/2 to b. (Since jots are
open, the midpoint itself and the floats a and b don't belong to any jot.)

Now here's the key point: for values that aren't exactly representable and
aren't perfect midpoints, the standard rounding modes, whether directed or
round-to-nearest, only ever care about which side of the midpoint the value to
be rounded lies. In other words:

*Observation.* If x and y belong to the same jot, then rnd(x) = rnd(y).

This is the point of jots: they represent the wiggle-room that we have to
perturb a real number without affecting the way that it rounds under any
of the standard rounding modes.

*Note.* Between any two consecutive *subnormal* values, we have 4 or more
jots, and above the maximum representable float we have infinitely many, but
the observation that rnd is constant on jots remains true at both ends of the
spectrum. Also note that jots, as defined above, don't cover the negative
reals, but we don't need them to for what follows.

Here's a lemma that we'll need shortly.

*Lemma.* Suppose that I is an open interval of the form (m 2^e, (m+1) 2^e)
for some integers m and e satisfying 2^p <= m. Then I is either a jot, or a
subinterval of a jot.

*Proof.* If m < 2^(p+1) then this is immediate from the definition. In
the general case, m satisfies 2^q <= m < 2^(q+1) for some integer q with
p <= q. Write n = floor(m / 2^(q-p)). Then:

n <= m / 2^(q-p) < n + 1, so
n * 2^(q-p) <= m < (n + 1) * 2^(q-p), so
n * 2^(q-p) <= m and m + 1 <= (n + 1) * 2^(q-p)

so

   n * 2^(e+q-p) <= m * 2^e and (m + 1) * 2^e <= (n + 1) * 2^(e+q-p)

So I is a subinterval of (n * 2^(e+q-p), (n+1) * 2^(e+q-p)), which is a jot.


The magic of round-to-odd
=

*Definition.* The function to-odd : R -> Z is defined by:

- to-odd(x) = x if x is an integer
- to-odd(x) = floor(x) if x is not an integer and floor(x) is odd
- to-odd(x) = ceil(x) if x is not an integer and floor(x) is even

*Properties.* Some easy monotonicity properties of to-odd, with proofs left
to the reader:

- If x < 2n for real x and integer n, then to-odd(x) < to-odd(2n)
- If 2n < x for real x and integer n, then to-odd(2n) < to-odd(x)

Here's a restatement of the main principle.

*Proposition.* With p and rnd as in the previous section, suppose that x is a
positive real number and that e is any integer satisfying 2^e <= x. Define a
real number y by:

y = 2^(e-p-1) to-odd(x / 2^(e-p-1))

Then rnd(y) = rnd(x).


Proof of the principle
==

In a nutshell, we show that either

- y = x, or
- x and y belong to the same jot

Either way, since rnd is constant on jots, we get rnd(y) = rnd(x).

Case 1: x = m * 2^(e-p) for some integer m. Then x / 2^(e-p-1) = 2m is
an (even) integer, so to-odd(x / 2^(e-p-1)) = (x / 2^(e-p-1)) and y = x.
Hence rnd(y) = rnd(x).

Case 2: m * 2^(e-p) < x < (m + 1) * 2^(e-p) for some integer m.

Then rearranging, 2m < x / 2^(e-p-1) < 2(m+1). So from the monotonicity
properties of to-odd we have:

   2m < to-odd(x / 2^(e-p-1)) < 2(m+1)

And multiplying through by 2^(e-p-1) we get

   m * 2^(e-p) < y < (m+1) * 2^(e-p).

So both x and y belong to the interval I = (m*2^(e-p), (m+1)*2^(e-p-1)).

Furthermore, 2^e <= x < (m + 1) * 2^(e-p), so 2^p < m + 1, and 2^p <= m.
So by the lemma above, I is either a

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-26 Thread Mark Dickinson



Mark Dickinson  added the comment:

[Raymond]
> [...] perhaps do an int/int division to match the other code path [...]

Sure, works for me.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-26 Thread Raymond Hettinger



Raymond Hettinger  added the comment:

Instead of calling float(), perhaps do an int/int division to match the other 
code path so that the function depends on only one mechanism for building the 
float result.

-return float(_isqrt_frac_rto(n, m << 2 * q) << q)
+(_isqrt_frac_rto(n, m << 2 * q) << q) / 1

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-26 Thread Mark Dickinson



Mark Dickinson  added the comment:

Related: 
https://stackoverflow.com/questions/32150888/should-ldexp-round-correctly

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-25 Thread Mark Dickinson



Mark Dickinson  added the comment:

There's also a potential double-rounding issue with ldexp, when the first 
argument is an int: ldexp(n, e) will first round n to a float, and then (again 
for results in the subnormal range) potentially also need to round the result.

>>> n = 2**53 + 1
>>> e = -1128
>>> math.ldexp(n, e)
0.0
>>> n / (1 << -e)
5e-324

I'm a bit (but only a bit) surprised and disappointed by the Windows issue; 
thanks, Tim. It seems to be okay on Mac (Intel, macOS 11.6.1):

>>> import math
>>> d = math.nextafter(0.0, 1.0)
>>> d
5e-324
>>> d3 = 7 * d
>>> d3
3.5e-323
>>> d3 / 4.0
1e-323
>>> math.ldexp(d3, -2)
1e-323

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-25 Thread Tim Peters



Tim Peters  added the comment:

Note that, on Windows, ldexp() in the presence of denorms can truncate. 
Division rounds, so

assert x / 2**i == ldexp(x, -i)

can fail.

>>> import math
>>> d = math.nextafter(0.0, 1.0)
>>> d
5e-324
>>> d3 = 7 * d # ....0111
>>> d3
3.5e-323
>>> d3 / 4.0   # rounds
1e-323
>>> math.ldexp(d3, -2)  # truncates
5e-324

or, perhaps more dramatically,

>>> d3 / 8.0, math.ldexp(d3, -3)
(5e-324, 0.0)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-25 Thread Raymond Hettinger



Raymond Hettinger  added the comment:

Mark, would it preferable to use ldexp() to build the float?

+   return math.ldexp(isqrt_frac_rto(n << -2 * q, m), q)
-   return isqrt_frac_rto(n << -2 * q, m) / (1 << -q)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-24 Thread Raymond Hettinger



Raymond Hettinger  added the comment:

> Here's a reference for this use of round-to-odd: 
> https://www.lri.fr/~melquion/doc/05-imacs17_1-expose.pdf

Thanks Mark.  It looks like I'll be getting a little education over the 
Thanksgiving holiday :-)

Shown below is the code that I'm thinking of using to test for correct 
rounding.  Is this the right way to do it?

# Verify correct rounding.  Find exact values for half the distance
# to the two adjacent representable floats.  The unrounded function
# input should fall between the exact squares of those values.

for i in range(10_000_000):
numerator: int = randrange(10 ** randrange(40)) + 1
denonimator: int = randrange(10 ** randrange(40)) + 1
x: Fraction = Fraction(numerator, denonimator)

root: float = sqrt_frac(numerator, denonimator)

r_up: float = math.nextafter(root, math.inf)
half_way_up: Fraction = (Fraction(root) + Fraction(r_up)) / 2
half_way_up_squared: Fraction = half_way_up ** 2

r_down: float = math.nextafter(root, -math.inf)
half_way_down: Fraction = (Fraction(root) + Fraction(r_down)) / 2
half_way_down_squared: Fraction = half_way_down ** 2

assert r_down < root < r_up
assert half_way_down_squared <= x <= half_way_up_squared

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-24 Thread Mark Dickinson



Mark Dickinson  added the comment:

Here's a reference for this use of round-to-odd: 
https://www.lri.fr/~melquion/doc/05-imacs17_1-expose.pdf

I'm happy to provide any proofs that anyone feels are needed.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-24 Thread Mark Dickinson


Mark Dickinson  added the comment:

> am still studying the new one

Sorry - I should have added at least _some_ comments to it.

Here's the underlying principle, which I think of as "The magic of 
round-to-odd":

Suppose x is a positive real number and e is an integer satisfying 2^e <= x. 
Then assuming IEEE 754 binary64 floating-point format, the quantity:

2^(e-54) * to-odd(x / 2^(e-54))

rounds to the same value as x does under _any_ of the standard IEEE 754 
rounding modes, including the round-ties-to-even rounding mode that we care 
about here.

Here, to-odd is the function R -> Z that maps each integer to itself, but maps 
each non-integral real number x to the *odd* integer nearest x. (So for example 
all of 2.1, 2.3, 2.9, 3.0, 3.1, 3.9 map to 3.)

This works because (x / 2^(e-54)) gives us an integer with at least 55 bits: 
the 53 bits we'll need in the eventual significand, a rounding bit, and then 
the to-odd supplies a "sticky" bit that records inexactness. Note that the 
principle works in the subnormal range too - no additional tricks are needed 
for that. In that case we just end up wastefully computing a few more bits than 
we actually _need_ to determine the correctly-rounded value.

The code applies this principle to the case x = sqrt(n/m) and
e = (n.bit_length() - m.bit_length() - 1)//2. The condition 2^e <= x holds 
because:

2^(n.bit_length() - 1) <= n
m < 2^m.bit_length()
 
so

2^(n.bit_length() - 1 - m.bit_length()) < n/m

and taking square roots gives

2^((n.bit_length() - 1 - m.bit_length())/2) < √(n/m)

so taking e = (n.bit_length() - 1 - m.bit_length())//2, we have

2^e <= 2^((n.bit_length() - 1 - m.bit_length())/2) < √(n/m)

Now putting q = e - 54, we need to compute

2^q * round-to-odd(√(n/m) / 2^q)

rounded to a float. The two branches both do this computation, by moving 2^q 
into either the numerator or denominator of the fraction as appropriate 
depending on the sign of q.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-23 Thread Raymond Hettinger



Raymond Hettinger  added the comment:

As a side effect of inlining the variance code, we also get to fix the error 
messages which were variance specific.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-23 Thread Raymond Hettinger



Raymond Hettinger  added the comment:

I've opened a PR to make this easy to experiment with.

It also worked with my frac_sqrt() and deci_sqrt(), but having all integer 
arithmetic and always correct rounding are nice wins.  

The only downside is that I completely understood the first two variants but am 
still studying the new one.  Perhaps Tim will have a look as well.

--
nosy: +tim.peters

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-23 Thread Raymond Hettinger



Change by Raymond Hettinger :


--
keywords: +patch
pull_requests: +27974
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/29736

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-23 Thread Mark Dickinson



Mark Dickinson  added the comment:

Hmm. isqrt_frac_rto is unnecessarily complicated. Here's a more streamlined 
version of the code.


import math

def isqrt_frac_rto(n, m):
"""
Square root of n/m, rounded to the nearest integer using round-to-odd.
"""
a = math.isqrt(n*m) // m
return a | (a*a*m != n)

def sqrt_frac(n, m):
"""
Square root of n/m as a float, correctly rounded.
"""
q = (n.bit_length() - m.bit_length() - 109) // 2
if q >= 0:
return float(isqrt_frac_rto(n, m << 2 * q) << q)
else:
return isqrt_frac_rto(n << -2 * q, m) / (1 << -q)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-23 Thread Mark Dickinson



Mark Dickinson  added the comment:

> Should the last line of sqrt_frac() be wrapped with float()?

It's already a float - it's the result of an int / int division.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-23 Thread Raymond Hettinger



Raymond Hettinger  added the comment:

Should the last line of sqrt_frac() be wrapped with float()?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-23 Thread Mark Dickinson



Mark Dickinson  added the comment:

Here's the float-and-Fraction-based code that I'm using to compare the 
integer-based code against:

def sqrt_frac2(n, m):
"""
Square root of n/m as a float, correctly rounded.
"""
f = fractions.Fraction(n, m)

# First approximation.
x = math.sqrt(n / m)

# Use the approximation to find a pair of floats bracketing the actual sqrt
if fractions.Fraction(x)**2 >= f:
x_lo, x_hi = math.nextafter(x, 0.0), x
else:
x_lo, x_hi = x, math.nextafter(x, math.inf)

# Check the bracketing. If math.sqrt is correctly rounded (as it will be on 
a
# typical machine), then the assert can't fail. But we can't rely on 
math.sqrt being
# correctly rounded in general, so would need some fallback.
fx_lo, fx_hi = fractions.Fraction(x_lo), fractions.Fraction(x_hi)
assert fx_lo**2 <= f <= fx_hi**2

# Compare true square root with the value halfway between the two floats.
mid = (fx_lo + fx_hi) / 2
if mid**2 < f:
return x_hi
elif mid**2 > f:
return x_lo
else:
# Tricky case: mid**2 == f, so we need to choose the "even" endpoint.
# Cheap trick: the addition in 0.5 * (x_lo + x_hi) will round to even.
return 0.5 * (x_lo + x_hi)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-23 Thread Mark Dickinson



Mark Dickinson  added the comment:

> Does the technique you had in mind involve testing 1 ulp up or down to see 
> whether its square is closer to the input?

Kinda sorta. Below is some code: it's essentially just pure integer operations, 
with a last minute conversion to float (implicit in the division in the case of 
the second branch). And it would need to be better tested, documented, and 
double-checked to be viable.


def isqrt_rto(n):
"""
Square root of n, rounded to the nearest integer using round-to-odd.
"""
a = math.isqrt(n)
return a | (a*a != n)


def isqrt_frac_rto(n, m):
"""
Square root of n/m, rounded to the nearest integer using round-to-odd.
"""
quotient, remainder = divmod(isqrt_rto(4*n*m), 2*m)
return quotient | bool(remainder)


def sqrt_frac(n, m):
"""
Square root of n/m as a float, correctly rounded.
"""
quantum = (n.bit_length() - m.bit_length() - 1) // 2 - 54
if quantum >= 0:
return float(isqrt_frac_rto(n, m << 2 * quantum) << quantum)
else:
return isqrt_frac_rto(n << -2 * quantum, m) / (1 << -quantum)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-23 Thread Raymond Hettinger



Raymond Hettinger  added the comment:

> It wouldn't be hard to go for _always_ correctly rounded 
> and actually get it over.

Yes, that would be the right thing to do.

Does the technique you had in mind involve testing 1 ulp up or down to see 
whether its square is closer to the input?


> an Emax of 999 and an Emin of -999 would already be 
> more than big enough.

Right. In haste, I confused max_exp=1024 with max_10_exp=308.  

Am still thinking that the precision needs to be bumped up a bit the 28 place 
default.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-23 Thread Mark Dickinson



Mark Dickinson  added the comment:

> we've run the ball almost the full length of the field and then failed to put 
> the ball over the goal line

But if we only go from "faithfully rounded" to "almost always correctly 
rounded", it seems to me that we're still a couple of millimetres away from 
that goal line. It wouldn't be hard to go for _always_ correctly rounded and 
actually get it over.

> Yes, the Emin and Emax for the default context is already almost big enough

I'm confused: big enough for what? I was thinking of the use-case where the 
inputs are all floats, in which case an Emax of 999 and an Emin of -999 would 
already be more than big enough.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-23 Thread Raymond Hettinger



Raymond Hettinger  added the comment:

> I'm not sure this is worth worrying about ...

Instead of writing simple, mostly accurate code with math.fsum(), these 
functions have already applied labor intensive measures to get an exact mean 
and exact sum of square differences expressed in fractions.  Then at the final 
step before the square root, it prematurely rounds to a float.  This is easy to 
fix and has a single step cost comparable to that already paid for each input 
datum.

In a sports analogy, we've run the ball almost the full length of the field and 
then failed to put the ball over the goal line.

Part of the module's value proposition is that it strives to be more accurate 
than obvious implementations.  The mean, variance, and pvariance function are 
correctly rounded.  In this regard, the stdev and pstdev functions are 
deficient, but they could easily be made to be almost always correctly rounded.

> One thought: would the deci_sqrt approach help with value ranges 
> where the values are well within float limits, but the squares
> of the values are not?

Yes, the Emin and Emax for the default context is already almost big enough:

Context(prec=28, rounding=ROUND_HALF_EVEN, Emin=-99, 
Emax=99, capitals=1, clamp=0, flags=[], 
traps=[InvalidOperation, DivisionByZero, Overflow])

We could bump that up to fully envelop operations on the sum of squares:

Context(Emin=-9_999_999, Emax=9_999_999, prec=50)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-23 Thread Mark Dickinson



Mark Dickinson  added the comment:

One thought: would the deci_sqrt approach help with value ranges where the 
values are well within float limits, but the squares of the values are not? 
E.g., on my machine, I currently get errors for both of the following:

>>> xs = [random.normalvariate(0.0, 1e200) for _ in range(10**6)]
>>> statistics.stdev(xs)

>>> xs = [random.normalvariate(0.0, 1e-200) for _ in range(10**6)]
>>> statistics.stdev(xs)

It's hard to imagine that there are too many use-cases for values of this size, 
but it still feels a bit odd to be constrained to only half of the dynamic 
range of float.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-23 Thread Mark Dickinson


Mark Dickinson  added the comment:

I'm not sure this is worth worrying about. We already have a very tight error 
bound on the result: if `x` is a (positive) fraction and `y` is the closest 
float to x, (and assuming IEEE 754 binary64, round-ties-to-even, no overflow or 
underflow, etc.) then `math.sqrt(y)` will be in error by strictly less than 1 
ulp from the true value √x, so we're already faithfully rounded. (And in 
particular, if the std. dev. is exactly representable as a float, this 
guarantees that we'll get that standard deviation exactly.)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-23 Thread Mark Dickinson



Change by Mark Dickinson :


--
nosy: +mark.dickinson

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45876] Improve accuracy of stdev functions in statistics

2021-11-23 Thread Raymond Hettinger



New submission from Raymond Hettinger :

The standard deviation computation in the statistics module is still subject to 
error even though the mean and sum of square differences are computed exactly 
using fractions.

The problem is that the exact fraction gets rounded to a float before going 
into math.sqrt() which also rounds.

It would be better to compute a high precision square root directly from the 
exact fraction rather than from a fraction rounded to a float.

Here are two ways to do it.  With Cpython, the second way is twice as fast.   
With PyPy3, the speed is about the same. 


def frac_sqrt(x: Fraction) -> float:
'Return sqrt to greater precision than math.sqrt()'
# Needed because a correctly rounded square root of
# a correctly rounded input can still be off by 1 ulp.
# Here we avoid the initial rounding and work directly
# will the exact fractional input.  The square root
# is first approximated with math.sqrt() and then
# refined with a divide-and-average step. Since the
# Newton-Raphson algorithm has quadratic convergence,
# one refinement step gives excellent accuracy.
a = Fraction(math.sqrt(x))
return float((x / a + a) / 2)


def deci_sqrt(x: Fraction) -> float:
ratio = Decimal(x.numerator) / Decimal(x.denominator)
return float(ratio.sqrt())

--
assignee: rhettinger
components: Library (Lib)
messages: 406824
nosy: rhettinger, steven.daprano
priority: normal
severity: normal
status: open
title: Improve accuracy of stdev functions in statistics
type: enhancement
versions: Python 3.11

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

41 matches

Mail list logo