[issue36546] Add quantiles() to the statistics module

2019-09-17 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset 31af1cce9d799475ba8c3dec3e239b4a75ce268f by Raymond Hettinger 
(Miss Islington (bot)) in branch '3.8':
bpo-36546: No longer a need to make "data" positional only (GH-16252) (GH-16253)
https://github.com/python/cpython/commit/31af1cce9d799475ba8c3dec3e239b4a75ce268f


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-09-17 Thread miss-islington


Change by miss-islington :


--
pull_requests: +15848
pull_request: https://github.com/python/cpython/pull/16253

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-09-17 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset 272d0d017aef585acf84bb0af99a90a2a8582b2c by Raymond Hettinger in 
branch 'master':
bpo-36546: No longer a need to make "data" positional only (GH-16252)
https://github.com/python/cpython/commit/272d0d017aef585acf84bb0af99a90a2a8582b2c


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-09-17 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
pull_requests: +15847
pull_request: https://github.com/python/cpython/pull/16252

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-07-21 Thread miss-islington


Change by miss-islington :


--
pull_requests: +14679
pull_request: https://github.com/python/cpython/pull/14899

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-07-19 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset e5bfd1ce9da51b64d157392e0a831637f7335ff5 by Raymond Hettinger 
(Miss Islington (bot)) in branch '3.8':
bpo-36546:  Clean-up comments (GH-14857) (#14859)
https://github.com/python/cpython/commit/e5bfd1ce9da51b64d157392e0a831637f7335ff5


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-07-19 Thread miss-islington


Change by miss-islington :


--
pull_requests: +14649
pull_request: https://github.com/python/cpython/pull/14859

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-07-19 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset eed5e9a9562d4dcd137e9f0fc7157bc3373c98cc by Raymond Hettinger in 
branch 'master':
bpo-36546:  Clean-up comments (GH-14857)
https://github.com/python/cpython/commit/eed5e9a9562d4dcd137e9f0fc7157bc3373c98cc


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-07-19 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
pull_requests: +14647
pull_request: https://github.com/python/cpython/pull/14857

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-06-24 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset 210358b25cf6425c81a341a074be6cd897c2d43d by Raymond Hettinger 
(Miss Islington (bot)) in branch '3.8':
bpo-36546: Mark first argument as position only (GH-14363) (GH-14364)
https://github.com/python/cpython/commit/210358b25cf6425c81a341a074be6cd897c2d43d


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-06-24 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset 1791128677e71f3b93cae4bbed66ac3c6e4b5110 by Raymond Hettinger in 
branch 'master':
 bpo-36546: Mark first argument as position only (GH-14363)
https://github.com/python/cpython/commit/1791128677e71f3b93cae4bbed66ac3c6e4b5110


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-06-24 Thread miss-islington


Change by miss-islington :


--
pull_requests: +14181
pull_request: https://github.com/python/cpython/pull/14364

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-06-24 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
pull_requests: +14180
pull_request: https://github.com/python/cpython/pull/14363

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-06-02 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset cba9f84725353455b0995bd47d0fa8cb1724464b by Raymond Hettinger in 
branch 'master':
bpo-36546: Add design notes to aid future discussions (GH-13769)
https://github.com/python/cpython/commit/cba9f84725353455b0995bd47d0fa8cb1724464b


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-06-02 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
pull_requests: +13653
pull_request: https://github.com/python/cpython/pull/13769

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-05-18 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset e917f2ed9af044fe808fc9b4ddc6c5eb99003500 by Raymond Hettinger in 
branch 'master':
bpo-36546: Add more tests and expand docs (#13406)
https://github.com/python/cpython/commit/e917f2ed9af044fe808fc9b4ddc6c5eb99003500


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-05-18 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
pull_requests: +13317

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-04-28 Thread Karthikeyan Singaravelan


Change by Karthikeyan Singaravelan :


--
pull_requests: +12922

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-04-28 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset db81ba1393af40ba920a996651e2c11943c3663c by Raymond Hettinger in 
branch 'master':
bpo-36546: More tests: type preservation and equal inputs (#13000)
https://github.com/python/cpython/commit/db81ba1393af40ba920a996651e2c11943c3663c


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-04-28 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
pull_requests: +12921

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-04-27 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

Thanks for propelling this forward :-)  I'm really happy to have an easy to 
reach tool that readily summarizes the shape of data and that can be used to 
compare how distributions differ.


> Octave and Maple call their parameter "method", so if we 
> stick with  "method" we're in good company.

The Langford paper also uses the word "method", so that is likely just the 
right word.


> I'm more concerned about the values taken by the method parameter.
> "Inclusive" and "Exclusive"h ave a related but distinct meaning 
> when it comes to quartiles which is different from the Excel usage

Feel free to change it to whatever communicates the best.  The meaning I was 
going for is closer to the notions of open-interval or closed interval.  In 
terms of use cases, one is for describing population data where the minimum 
input really is the 0th percentile and the maximum is the 100th percentile.  
The other is for sample data where the underlying population will have values 
outside the range of the empirical samples.  I'm not sure what words bests 
describe the distinction.  The word "inclusive" and "exclusive" approximated 
that idea but maybe you can do better.


> I have a working version of quantiles() which supports cutpoints 
> and all nine calculation methods supported by R.

My recommendation is to not do this.  Usually, it's better to start simple, 
focusing on core use cases (i.e. sample and population), then let users teach 
us what additions they really need (this is a YAGNI argument).  Once a feature 
is offered, it can never be taken away even if it proves to be not helpful in 
most situations or is mostly unused.

In his 20 year retrospective, Hyndman expressed dismay that his paper had the 
opposite effect of what was intended (hoping for a standardization on a single 
approach rather than a proliferation of all nine methods).  My experience in 
API design is that offering users too many choices will complicate their lives, 
leading to suboptimal and incorrect choices and creating confusion.   That is 
likely why most software packages other than R only offer one or two options.

If you hold off, you can always add these options later.  We might just find 
that what we've got suffices for most everyday uses.   Also, I thought the 
spirit of the statistics module was to offer a few core statistical tools aimed 
at non-experts, deferring to external packages for more rich collections of 
optimized, expert tools that cover every option.  For me, the best analogy is 
my two cameras. One is a point and shoot that is easy to use and does a 
reasonable job. The other is a professional SLR with hundreds of settings that 
I had to go to photography school to learn to use.

FWIW, I held-off on adding "cut_points" because the normal use case is to get 
equally spaced quantiles.  It would be unusual to want 0.25 and 0.50 but not 
0.75.   The other reason is that user provided cut-points conflict with core 
concept of "Divide *dist* into *n* continuous intervals with equal 
probability."  User provided cut-points provide other ways to go wrong as well 
(not being sorted, 0.0 or 1.0 not being valid for some methods, values outside 
the range 0.0 to 1.0).  The need for cut_points makes more sense for numpy or 
scipy where is common to pass around a linspace. Everyday Python isn't like 
that.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-04-26 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

Hi Raymond,

Thanks for working on this, I'm really keen to see this happen and I 
appreciate your efforts so far.

Your arguments have also convinced me that the default calculation 
type you chose (PERCENTILE.EXC or R type=6) is suitable.

But I have a couple of concerns about the API:

1. You call the parameter to choose a calculation definition "method"; 
I'm mildly concerned that might lead to some confusion or ambiguity with 
methods on a class. Do you think I'm worrying about nothing?

R calls the same parameter "type", which I don't like either since that 
also has another meaning in Python. Numpy calls an equivalent parameter 
"interpolation", which I think is not only excessively long to type, but 
also misleading since at least two of the calculation methods used by 
numpy don't perform any interpolation at all.

Octave and Maple call their parameter "method", so if we stick with 
"method" we're in good company.

2. I'm more concerned about the values taken by the method parameter. 
"Inclusive" and "Exclusive" have a related but distinct meaning when it 
comes to quartiles which is different from the Excel usage:

Using range(1, 9) as the data:

- standard "inclusive" or "exclusive" quartiles:   2.5, 4.5, 6.5
- Excel's "exclusive" quartiles (R type 6):2.25, 4.5, 6.75
- Excel's "inclusive" quartiles (R type 7):2.75, 4.5, 6.25

I'd prefer to keep inclusive/exclusive for such quartiles.

I have a working version of quantiles() which supports cutpoints and all 
nine calculation methods supported by R. Following your lead, I've kept 
type=6 as the default. I need to tidy up the tests and inline docs, 
which time permitting I will do over the next 48 hours, and then it will 
be ready for initial review.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-04-23 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

As requested, I've made the API easily extendable to handle alternative methods 
if those are deemed necessary.  Also, it is easy to change the default.

That said, it is my hope that these two methods and the current default get 
kept.  They have several virtues:  easy to explain, obviousness, widely adopted 
(most of the  other packages offer at least of these two methods), and they 
have most of the desired properties listed in Hyndman & Fan.  They are a good 
practical choice.

--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-04-23 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset 9013ccf6d8037f6ae78145a42d194141cb10d332 by Raymond Hettinger in 
branch 'master':
bpo-36546: Add statistics.quantiles() (#12710)
https://github.com/python/cpython/commit/9013ccf6d8037f6ae78145a42d194141cb10d332


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-04-21 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

One other thought:  The PR implements both methods 6 and 7 which means that 
that we can reproduce the results of every major stats tool except for 
Mathematica.  The summary of properties chart of  in Hyndman & Fan lists our 
default as satisfying 5 of 6 desirable properties.  This is the same as for 
method 8 which almost no one uses by default.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-04-21 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

Steven, as requested I added a documentation note reserving the right to add 
other interpolation methods.  We can make the note stronger if you like.  
Otherwise, I think we're good to go now. 

Mark, thanks for the link.  I've read all the posts and agree that we might 
consider changing the default interpolation method prior to the release.  For 
now, matching what Excel and SciPy does seems like a reasonable starting point.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-04-08 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

Thanks for taking a detailed look.  I'll explore the links you provided shortly.

The API is designed to be extendable so that we don't get trapped by the choice 
of computation method.  If needed, any or all of the following extensions are 
possible without breaking backward compatibility:

  quantiles(data, n=4, already_sorted=True) # Skip resorting
  quantiles(data, cut_points=[0.02, 0.25, 0.50, 0.75, 0.98]) # box-and-whiskers
  quantiles(data, interp_method='nearest') # also: "low", "high", "midpoint"
  quantiles(data, inclusive=True)# For description of a complete population

The default approach used in the PR matches what is used by MS Excel's 
PERCENTILE.EXC function¹.  That has several virtues. It is easy to explain.  It 
allows two unequal sized datasets to be compared (perhaps with a QQ plot) to 
explore whether they are drawn from the same distribution.  For sampled data, 
the quantiles tend to remain stable as more samples are added.  For samples 
from a known distribution (i.e normal variates), it tends to give the same 
results as ihv_cdf():

>>> iq = NormalDist(100, 15)
>>> cohort = iq.samples(10_000)
>>> for ref, est in zip(quantiles(iq, n=10), quantiles(cohort, n=10)):
... print(f'{ref:5.1f}\t{est:5.1f}')
...
 80.881.0
 87.487.8
 92.192.3
 96.296.3
100.0   100.1
103.8   104.0
107.9   108.0
112.6   112.9
119.2   119.3

My thought was to start with something like this and only add options if they 
get requested (the most likely request is an inclusive=True option to emulate 
MS Excel's PERCENTILE.INC).  

If we need to leave the exact method unguaranteed, that's fine.  But I think it 
would be better to guarantee the match to PERCENTILE.EXC and then handle other 
requests through API extensions rather than revisions.


¹ https://exceljet.net/excel-functions/excel-percentile.exc-function

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-04-08 Thread Mark Dickinson


Mark Dickinson  added the comment:

Related previous discussion on python-ideas: 
https://mail.python.org/pipermail/python-ideas/2018-March/049327.html

--
nosy: +mark.dickinson

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-04-07 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

I think adding quantiles (sometimes called fractiles) is a good feature to add. 
I especially have some use-cases for quartiles. I especially like that it 
delegates to the inv_cdf() method when available, and I'm happy with the API 
you suggested.

Forgive me if you're already aware of this, but the calculation of quantiles is 
unfortunately complicated by the fact that there are so many different ways to 
calculate them. (I see you have mentioned a potential future API for 
interp_method.)

See, for example:

http://jse.amstat.org/v14n3/langford.html

for a discussion. My own incomplete survey of statistics software has found 
about 20 distinct methods for calculating quantiles in use. I'm very happy to 
see this function added, but I'm not happy to commit to a specific calculation 
method without discussion.

If you agree that we can change the implementation later (and hence the 
specific cut points returned) then I see no reason why we can't get this in 
before feature freeze, and then do a review to find the "best" default 
implementation later. I already have three candidates:

1. Langford suggests his "CDF method 4", which is equivalent to Hyndman and 
Fan's Definition 2; it is also the default method used by SAS.

2. Hyndman and Fan themselves recommend their Definition 8:

https://robjhyndman.com/publications/quantiles/

3. R's default is H's Definition 7. (I suggest this only to ease comparisons 
with results from R, not because it has any statistical advantage.)

Do we agree that there is to be no backwards-compatibility guarantee made on 
the implementation and the specific cut-points returned? (Or at least not yet.)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-04-06 Thread Karthikeyan Singaravelan


Change by Karthikeyan Singaravelan :


--
nosy: +steven.daprano

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-04-06 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

https://patch-diff.githubusercontent.com/raw/python/cpython/pull/12710.diff

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-04-06 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
keywords: +patch
pull_requests: +12635
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36546] Add quantiles() to the statistics module

2019-04-06 Thread Raymond Hettinger


New submission from Raymond Hettinger :

It is a common and useful data analysis technique to examine quartiles, 
deciles, and percentiles. It is especially helpful for comparing distinct 
datasets (heights of boys versus heights of girls) or for comparing against a 
reference distribution (empirical data versus a normal distribution for 
example).

--- sample session ---

>>> from statistics import NormalDist, quantiles
>>> from pylab import plot

# SAT exam scores
>>> sat = NormalDist(1060, 195)
>>> list(map(round, quantiles(sat, n=4)))   # quartiles
[928, 1060, 1192]
>>> list(map(round, quantiles(sat, n=10)))  # deciles
[810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]

# Summarize a dataset
>>> data = [110, 96, 155, 87, 98, 82, 156, 88, 172, 102, 91, 184, 105, 114, 104]
>>> quantiles(data, n=2)# median
[104.0]
>>> quantiles(data, n=4)# quartiles
[91.0, 104.0, 155.0]
>>> quantiles(data, n=10)   # deciles
[85.0, 88.6, 95.0, 99.6, 104.0, 108.0, 122.2, 155.8, 176.8]


# Assess when data is normally distributed by comparing quantiles
>>> reference_dist = NormalDist.from_samples(data)
>>> quantiles(reference_dist, n=4)
[93.81594518619364, 116.27, 138.71738814713967]

# Make a QQ plot to visualize how well the data matches a normal distribution
# plot(quantiles(data, n=7), quantiles(reference_dist, n=7))

--
components: Library (Lib)
messages: 339544
nosy: rhettinger
priority: normal
severity: normal
status: open
title: Add quantiles() to the statistics module
type: enhancement
versions: Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com