[issue20479] Efficiently support weight/frequency mappings in the statistics module

2019-01-21 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

Here is some further information on weights in statistics in general, 
and SAS and Stata specifically:

https://blogs.sas.com/content/iml/2017/10/02/weight-variables-in-statistics-sas.html

Quote:

use the FREQ statement to specify integer frequencies for 
repeated observations. Use the WEIGHT statement when you 
want to decrease the influence that certain observations 
have on the parameter estimates. 

http://support.sas.com/kb/22/600.html

https://www.stata.com/manuals13/u20.pdf#u20.23

Executive summary:

- Stata defines four different kinds of weights;

- SAS defines two, WEIGHT and FREQ (frequency);

- SAS truncates FREQ values to integers, with zero or 
  negative meaning that the data point is to be ignored;

- Using FREQ is equivalent to repeating the data points.
  In Python terms:

  mean([1, 2, 3, 4], freq=[1, 0, 3, 1])

  would be equivalent to mean([1, 3, 3, 3, 4]).

- Weights in SAS are implicitly normalised to sum to 1, 
  but some functions allow you to normalise to sum to the
  number of data points, because it sometimes makes a 
  difference.

- It isn't clear to me what the physical meaning of weights
  in SAS actually is. The documentation is unclear, it *could*
  as simple as the definition of weighted mean here:

https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Mathematical_definition

but how that extends to more complex SAS functions is unclear to me.

(And for what its worth, I don't think SAS's MEAN function supports 
weights at all. Any SAS users here that could comment?)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2019-01-20 Thread Oscar Benjamin


Oscar Benjamin  added the comment:

Sorry, sent too soon...

> Matlab doesn't support even weighted mean as far as I can tell. There
> is wmean on the matlab file exchange:
https://stackoverflow.com/a/36464881/9450991

This is a separate function `wmean(data, weights)`. It has to be a
separate function though because it's third party code so the author
couldn't change the main mean function.

R ships with a weighted.mean function but I think for standard
deviation you need third party libs.

A quick survey but the main impression I get is that providing API for
this is not that common. The only good-looking API is the statsmodel
one.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2019-01-20 Thread Oscar Benjamin


Oscar Benjamin  added the comment:

> I would find it very helpful if somebody has time to do a survey of
> other statistics libraries or languages (e.g. numpy, R, Octave, Matlab,
> SAS etc) and see how they handle data with weights.

Numpy has only sporadic support for this. The standard mean function
does not have any way to provide weights but there is an alternative
called average that computes the mean and has an optional weights
argument. I've never heard of average before searching for "numpy
weighted mean" just now. Numpy's API often has bits of old cruft from
where various numerical packages were joined together so I'm not sure
they would recommend their current approach. I don't think there are
any other numpy functions for providing weighted statistics.

Statsmodels does provide an API for this as explained here:
https://stackoverflow.com/a/36464881/9450991
Their API is that you create an object with data and weights and can
then call methods/attributes for statistics.

Matlab doesn't support even weighted mean as far as I can tell. There
is wmean on the matlab file exchange:

>
> - what APIs do they provide?
> - do they require weights to be positive integers, or do they
>   support arbitrary float weights?
> - including negative weights?
>   (what physical meaning does a negative weight have?)
>
> At the moment, a simple helper function seems to do the trick for
> non-negative integer weights:
>
> def flatten(items):
> for item in items:
> yield from item
>
> py> data = [1, 2, 3, 4]
> py> weights = [1, 4, 1, 2]
> py> statistics.mean(flatten([x]*w for x, w in zip(data, weights)))
> 2.5
>
> In principle, the implementation could be as simple as a single
> recursive call:
>
> def mean(data, weights=None):
> if weights is not None:
> return mean(flatten([x]*w for x, w in zip(data, weights)))
> # base case without weights is unchanged
>
> or perhaps it could be just a recipe in the docs.
>
> --
>
> ___
> Python tracker 
> 
> ___

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2019-01-18 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

> Is this proposal still relevant?

Yes.

As Raymond says, deciding on a good API is the hard part. Its relatively 
simple to change a poor implementation for a better one, but backwards 
compatibility means that changing the API is very difficult.

I would find it very helpful if somebody has time to do a survey of 
other statistics libraries or languages (e.g. numpy, R, Octave, Matlab, 
SAS etc) and see how they handle data with weights.

- what APIs do they provide?
- do they require weights to be positive integers, or do they 
  support arbitrary float weights?
- including negative weights? 
  (what physical meaning does a negative weight have?)

At the moment, a simple helper function seems to do the trick for 
non-negative integer weights:

def flatten(items):
for item in items:
yield from item

py> data = [1, 2, 3, 4]
py> weights = [1, 4, 1, 2]
py> statistics.mean(flatten([x]*w for x, w in zip(data, weights)))
2.5

In principle, the implementation could be as simple as a single 
recursive call:

def mean(data, weights=None):
if weights is not None:
return mean(flatten([x]*w for x, w in zip(data, weights)))
# base case without weights is unchanged

or perhaps it could be just a recipe in the docs.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2019-01-18 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

> Is this proposal still relevant? If so, I would 
> like to work on its implementation.

The first question is the important one.  Writing implementations is usually 
the easy part.  Deciding on whether there is a real need and creating a usable, 
extendable, and clean API is often the hard part.  So please don't run to a PR, 
that just complicates the important part.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2019-01-18 Thread Rémi Lapeyre

Rémi Lapeyre  added the comment:

Is this proposal still relevant? If so, I would like to work on its 
implementation.

I think the third proposition to change the API to have a new `weights` 
parameter is the best has it does not blindly suppose that a tuple is a pair 
(value, weight) which could not always be true.

--
nosy: +remi.lapeyre

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2019-01-18 Thread Rémi Lapeyre

Change by Rémi Lapeyre :


--
versions: +Python 3.8 -Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2017-10-28 Thread Nick Coghlan

Nick Coghlan  added the comment:

Thinking back to my signal processing days, I have to agree that our weightings 
(filter definitions) were usually separate from our data (live signals). 
Similarly, systems engineering trade studies all maintained feature weights 
separately from the assessments of the individual options.

The comment from my original RFE about avoiding expanding value -> 
weight/frequency mappings "to a full iterable all the time" doesn't actually 
make much sense in 3.x, since m.keys() and m.values() are now typically able to 
avoid data copying.

So +1 from me for the separates "weights" parameter, with the 
m.keys()/m.values() idiom used to handle mappings like Counter.

As another point in favour of that approach, it's trivial to build zero-copy 
weighted variants on top of it for mappings with cheap key and value views:

def weighted_mean(mapping):
return statistics.mean(mapping.keys(), mapping.values())

By contrast, if the lowest level primitive provided is a mapping based API, 
then when you do have separate values-and-weights iterables, you're going to 
have a much harder time avoiding building a completely new container.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2017-10-28 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

My recommendation is to have *weights* as an optional argument:

statistics.mean(values, weights=None)

While it is tempting to special case dicts and counters, I got feedback from 
Jake Vanderplas and Wes McKinney that in practice it is more common to have the 
weights as a separate list/array/vector.

That API has other advantages as well.  For starters, it is a simple extension 
of the existing API, so it isn't a disruptive change.  Also, it works well with 
mapping views: 
   
   statistics.mean(vehicle_sales.keys(), vehicle_sales.values())

And the API also helps support use cases where different weightings are being 
explored for the same population:

   statistics.mean(salary, years_of_service)
   statistics.mean(salary, education)
   statistics.mean(salary, age)

--
nosy: +rhettinger
versions: +Python 3.7 -Python 3.5

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2017-10-28 Thread Serhiy Storchaka

Change by Serhiy Storchaka :


--
nosy:  -serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2014-02-03 Thread Wolfgang Maier

Wolfgang Maier added the comment:

Well, I was thinking about frequencies (ints) when suggesting

for x,m in data.items():
T = _coerce_types(T, type(x))
n, d = exact_ratio(x)
partials[d] = partials_get(d, 0) + n*m

in my previous message. To support weights (float or Rational) this would have 
to be more sophisticated.

Wolfgang

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20479
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2014-02-03 Thread Oscar Benjamin

Oscar Benjamin added the comment:

 in my previous message. To support weights (float or Rational) this would 
 have to be more sophisticated.

I guess you'd do:

 for x,w in data.items():
 T = _coerce_types(T, type(x))
 xn, xd = exact_ratio(x)
 wn, wd = exact_ratio(w)
 partials[d] = partials_get(xd * wd, 0) + xn * wn

Variance is only slightly trickier. Median would be more complicated.

I just think that I prefer to know when I look at code that something is being
treated as a mapping or as an iterable. So when I look at

d = f(x, y, z)
v = variance_map(d)

It's immediately obvious what d is and how the function variance_map is using
it.

As well as the benefit of readability there's also the fact that accepting
different kinds of input puts strain on any attempt to modify your code in the
future. Auditing the code requires understanding at all times that the name
data is bound to a quantum superposition of different types of object.

Either every function would have to have the same iterable or mapping
interface or there would have to be some other convention for making it clear
which ones do. Perhaps the functions that don't make sense for a mapping could
explicitly reject them rather than treating them as an iterable.

I just think it's simpler to have a different function name for each type of
input. Then it's clear what functions are available for working with mappings.

If you were going for something completely different then you could have an
object-oriented interface where there are classes for the different types of
data and methods that do the right thing in each case.

Then you would do

v = WeightedData(d).variance()

The ordinary variance() function could just become a shortcut for

def variance(data):
return SequenceData(data).variance()

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20479
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2014-02-02 Thread Steven D'Aprano

Steven D'Aprano added the comment:

Off the top of my head, I can think of three APIs:

(1) separate functions, as Nick suggests:
mean vs weighted_mean, stdev vs weighted_stdev

(2) treat mappings as an implied (value, frequency) pairs

(3) take an additional argument to switch between unweighted and weighted modes.

I dislike #3, but will consider the others for 3.5.

--
assignee:  - stevenjd

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20479
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2014-02-02 Thread Oscar Benjamin

Oscar Benjamin added the comment:

On 2 February 2014 11:55, Steven D'Aprano rep...@bugs.python.org wrote:

 (1) separate functions, as Nick suggests:
 mean vs weighted_mean, stdev vs weighted_stdev

This would be my preferred approach. It makes it very clear which
functions are available for working with map style data. It will be
clear from both the module documentation and a casual introspection of
the module that those APIs are present for those who might want them.
Also apart from mode() the implementation of each function on
map-format data will be completely different from the iterable version
so you'd want to have it as a separate function at least internally
anyway.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20479
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2014-02-02 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

See also issue18844.

--
nosy: +serhiy.storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20479
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2014-02-02 Thread Wolfgang Maier

Wolfgang Maier added the comment:

 -Ursprüngliche Nachricht-
 Von: Steven D'Aprano [mailto:rep...@bugs.python.org]
 Gesendet: Sonntag, 2. Februar 2014 12:55
 An: wolfgang.ma...@biologie.uni-freiburg.de
 Betreff: [issue20479] Efficiently support weight/frequency mappings in the
 statistics module
 
 
 Steven D'Aprano added the comment:
 
 Off the top of my head, I can think of three APIs:
 
 (1) separate functions, as Nick suggests:
 mean vs weighted_mean, stdev vs weighted_stdev
 
 (2) treat mappings as an implied (value, frequency) pairs
 

(2) is clearly my favourite. (1) may work well, if you have a module with a 
small fraction of functions, for which you need an alternate API.
In the statistics module, however, almost all of its current functions could 
profit from having a way to treat mappings specially.
In such a case, (1) is prone to create lots of redundancies.

I do not share Oscar's opinion that

 apart from mode() the implementation of each function on
 map-format data will be completely different from the iterable version
 so you'd want to have it as a separate function at least internally
 anyway.

Consider _sum's current code (docstring omitted for brevity):
def _sum(data, start=0):
n, d = _exact_ratio(start)
T = type(start)
partials = {d: n}  # map {denominator: sum of numerators}
# Micro-optimizations.
coerce_types = _coerce_types
exact_ratio = _exact_ratio
partials_get = partials.get
# Add numerators for each denominator, and track the current type.
for x in data:
T = _coerce_types(T, type(x))
n, d = exact_ratio(x)
partials[d] = partials_get(d, 0) + n
if None in partials:
assert issubclass(T, (float, Decimal))
assert not math.isfinite(partials[None])
return T(partials[None])
total = Fraction()
for d, n in sorted(partials.items()):
total += Fraction(n, d)
if issubclass(T, int):
assert total.denominator == 1
return T(total.numerator)
if issubclass(T, Decimal):
return T(total.numerator)/total.denominator
return T(total)

all you'd have to do to treat mappings as proposed here is to add a check 
whether we are dealing with a mapping, then in this case, instead of the for 
loop:

for x in data:
T = _coerce_types(T, type(x))
n, d = exact_ratio(x)
partials[d] = partials_get(d, 0) + n

use this:

for x,m in data.items():
T = _coerce_types(T, type(x))
n, d = exact_ratio(x)
partials[d] = partials_get(d, 0) + n*m

and no other changes (though I haven't tested this carefully).

Wolfgang

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20479
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2014-02-01 Thread Nick Coghlan

Changes by Nick Coghlan ncogh...@gmail.com:


--
dependencies: +Avoid inadvertently special casing Counter in statistics module
versions: +Python 3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20479
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20479] Efficiently support weight/frequency mappings in the statistics module

2014-02-01 Thread Nick Coghlan

New submission from Nick Coghlan:

Issue 20478 suggests ensuring that even weight/frequency mappings like 
collections.Counter are consistently handled as iterables in the current 
statistics module API.

However, it likely makes sense to provide public APIs that support efficiently 
working with such weight/frequency mappings directly, rather than requiring 
that they be expanded to a full iterable all the time.

One possibility would be to provide parallel APIs with the _map suffix, similar 
to the format() vs format_map() distinction in the string formatting APIs.

--
components: Library (Lib)
messages: 209931
nosy: gregory.p.smith, ncoghlan, oscarbenjamin, stevenjd, wolma
priority: normal
severity: normal
stage: needs patch
status: open
title: Efficiently support weight/frequency mappings in the statistics module
type: enhancement

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20479
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com