Re: [Python-ideas] NAN handling in the statistics module

2019-01-10 Thread Neil Girdhar


On Monday, January 7, 2019 at 3:16:07 AM UTC-5, Steven D'Aprano wrote:
>
> (By the way, I'm not outright disagreeing with you, I'm trying to weigh 
> up the pros and cons of your position. You've given me a lot to think 
> about. More below.) 
>
> On Sun, Jan 06, 2019 at 11:31:30PM -0800, Nathaniel Smith wrote: 
> > On Sun, Jan 6, 2019 at 11:06 PM Steven D'Aprano  > wrote: 
> > > I'm not wedded to the idea that the default ought to be the current 
> > > behaviour. If there is a strong argument for one of the others, I'm 
> > > listening. 
> > 
> > "Errors should never pass silently"? Silently returning nonsensical 
> > results is hard to defend as a default behavior IMO :-) 
>
> If you violate the assumptions of the function, just about everything 
> can in principle return nonsensical results. True, most of the time you 
> have to work hard at it: 
>
> class MyList(list): 
> def __len__(self): 
> return random.randint(0, sys.maxint) 
>
> but it isn't unreasonable to document the assumptions of a function, and 
> if the caller violates those assumptions, Garbage In Garbage Out 
> applies. 
>

I'm with Antoine, Nathaniel, David, and Chris: it is unreasonable to 
silently return nonsensical results even if you've documented it.  
Documenting it only makes it worse because it's like an "I told you so" 
when people finally figure out what's wrong and go to file the bug.
 

>
> E.g. bisect requires that your list is sorted in ascending order. If it 
> isn't, the results you get are nonsensical. 
>
> py> data = [8, 6, 4, 2, 0] 
> py> bisect.bisect(data, 1) 
> 0 
>
> That's not a bug in bisect, that's a bug in the caller's code, and it 
> isn't bisect's responsibility to fix it. 
>
> Although it could be documented better, that's the current situation 
> with NANs and median(). Data with NANs don't have a total ordering, and 
> total ordering is the unstated assumption behind the idea of a median or 
> middle value. So all bets are off. 
>
>   
> > > How would you answer those who say that the right behaviour is not to 
> > > propogate unwanted NANs, but to fail fast and raise an exception? 
> > 
> > Both seem defensible a priori, but every other mathematical operation 
> > in Python propagates NaNs instead of raising an exception. Is there 
> > something unusual about median that would justify giving it unusual 
> > behavior? 
>
> Well, not everything... 
>
> py> NAN/0 
> Traceback (most recent call last): 
>   File "", line 1, in  
> ZeroDivisionError: float division by zero 
>
>
> There may be others. But I'm not sure that "everything else does it" is 
> a strong justification. It is *a* justification, since consistency is 
> good, but consistency does not necessarily outweigh other concerns. 
>
> One possible argument for making PASS the default, even if that means 
> implementation-dependent behaviour with NANs, is that in the absense of 
> a clear preference for FAIL or RETURN, at least PASS is backwards 
> compatible. 
>
> You might shoot yourself in the foot, but at least you know its the same 
> foot you shot yourself in using the previous version *wink* 
>
>
>
> -- 
> Steve 
> ___ 
> Python-ideas mailing list 
> python...@python.org  
> https://mail.python.org/mailman/listinfo/python-ideas 
> Code of Conduct: http://python.org/psf/codeofconduct/ 
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-09 Thread Oscar Benjamin
On Wed, 9 Jan 2019 at 05:20, Steven D'Aprano  wrote:
>
> On Mon, Jan 07, 2019 at 11:27:22AM +1100, Steven D'Aprano wrote:
>
> [...]
> > I propose adding a "nan_policy" keyword-only parameter to the relevant
> > statistics functions (mean, median, variance etc), and defining the
> > following policies:
>
>
> I asked some heavy users of statistics software (not just Python users)
> what behaviour they would find useful, and as I feared, I got no
> conclusive answer. So far, the answers seem to be almost evenly split
> into four camps:
>
> - don't do anything, it is the caller's responsibility to filter NANs;
>
> - raise an immediate error;
>
> - return a NAN;
>
> - treat them as missing data.

I would prefer to raise an exception in on nan. It's much easier to
debug an exception than a nan.

Take a look at the Julia docs for their statistics module:
https://docs.julialang.org/en/v1/stdlib/Statistics/index.html

In julia they have defined an explicit "missing" value. With that you
can explicitly distinguish between a calculation error and missing
data. The obvious Python equivalent would be None.

> On consideration of all the views expressed, thank you to everyone who
> commented, I'm now inclined to default to returning a NAN (which happens
> to be the current behaviour of mean etc, but not median except by
> accident) even if it impacts performance.

Whichever way you go with this it might make sense to provide helper
functions for users to deal with nans e.g.:

xbar = mean(without_nans(data))
xbar = mode(replace_nans_with_None(data))

--
Oscar
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-09 Thread Jonathan Fine
I've just read statistics.py, and found something that might be
usefully considered along with the NaN question.

>>> median([1])
1
>>> median([1, 1])
1.0

To record this, and associated behaviour involving Fraction, I've added:
Division by 2 in statistics.median:
https://bugs.python.org/issue35698

-- 
Jonathan
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-08 Thread Tim Peters
[David Mertz ]
> I think consistent NaN-poisoning would be excellent behavior.  It will
> always make sense for median (and its variants).
>
>> >>> statistics.mode([2, 2, nan, nan, nan])
>> nan
>> >>> statistics.mode([2, 2, inf - inf, inf - inf, inf - inf])
>> 2
>
>
> But in the mode case, I'm not sure we should ALWAYS treat a NaN as
> poisoning the result.

I am:  I thought about the following but didn't write about it because
it's too strained to be of actual sane use ;-)

>  If NaN means "missing value" then sometimes it could change things,
>?and we shouldn't guess.  But what if it cannot?
>
> >>> statistics.mode([9, 9, 9, 9, nan1, nan2, nan3])
>
> No matter what missing value we take those nans to maybe-possibly represent, 9
> is still the most common element.  This is only true when the most common 
> thing
> occurs at least as often as the 2nd most common thing PLUS the number
> of all NaNs.  But in that case, 9 really is the mode.

See "too strained" above.

It's equally true that, e.g., the _median_ of your list above:

[9, 9, 9, 9, nan1, nan2, nan3]

is also 9 regardless of what values are plugged in for the nans.  That
may be easier to realize at first with a simpler list, like

[5, 5, nan]

It sounds essentially useless to me, just theoretically possible to
make a mess of implementations to cater to.

"The right" (obvious, unsurprising, useful, easy to implement, easy to
understand) non-exceptional behavior in the presence of NaNs is to
pretend they weren't in the list to begin with.  But I'd rather
;people ask for that _if_ that's what they want.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-08 Thread David Mertz
On Tue, Jan 8, 2019 at 11:57 PM Tim Peters  wrote:

> I'd like to see internal consistency across the central-tendency
> statistics in the presence of NaNs.  What happens now:
>

I think consistent NaN-poisoning would be excellent behavior.  It will
always make sense for median (and its variants).

>>> statistics.mode([2, 2, nan, nan, nan])
> nan
> >>> statistics.mode([2, 2, inf - inf, inf - inf, inf - inf])
> 2
>

But in the mode case, I'm not sure we should ALWAYS treat a NaN as
poisoning the result.  If NaN means "missing value" then sometimes it could
change things, and we shouldn't guess.  But what if it cannot?

>>> statistics.mode([9, 9, 9, 9, nan1, nan2, nan3])

No matter what missing value we take those nans to maybe-possibly
represent, 9 is still the most common element.  This is only true when the
most common thing occurs at least as often as the 2nd most common thing
PLUS the number of all NaNs.  But in that case, 9 really is the mode.

We have one example of non-poisoning NaN in basic operations:

>>> nan**0
1.0

So if the NaN "cannot possibly change the answer" then its reasonable to
produce a non-NaN answer IMO.  Except we don't really get that with 0**nan
or 0*nan already... so a NaN-poisoning mode wouldn't actually offend my
sensibilities that much. :-).

I guess you could argue that NaN "could be inf".  In that case 0*nan being
nan makes sense.  But this still feels hard to slightly odd:

>>> 0**inf
0.0
>>> 0**nan
nan

I guess it's supported by:

>>> 0**-1
ZeroDivisionError: 0.0 cannot be raised to a negative power

A *missing value* could be a negative one.
-- 
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-08 Thread Steven D'Aprano
On Mon, Jan 07, 2019 at 11:27:22AM +1100, Steven D'Aprano wrote:

[...]
> I propose adding a "nan_policy" keyword-only parameter to the relevant 
> statistics functions (mean, median, variance etc), and defining the 
> following policies:


I asked some heavy users of statistics software (not just Python users) 
what behaviour they would find useful, and as I feared, I got no 
conclusive answer. So far, the answers seem to be almost evenly split 
into four camps:

- don't do anything, it is the caller's responsibility to filter NANs;

- raise an immediate error;

- return a NAN;

- treat them as missing data.


(Currently it is a small sample size, so I don't expect the 
answers will stay evenly split if more people answer.)

On consideration of all the views expressed, thank you to everyone who 
commented, I'm now inclined to default to returning a NAN (which happens 
to be the current behaviour of mean etc, but not median except by 
accident) even if it impacts performance.




-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-08 Thread Tim Peters
I'd like to see internal consistency across the central-tendency
statistics in the presence of NaNs.  What happens now:

mean:  the code appears to guarantee that a NaN will be returned if a
NaN is in the input.

median:  as recently detailed, just about anything can happen,
depending on how undefined behaviors in .sort() interact.

mode:  while NaN != NaN at the Python level, internally dicts use an
identity shortcut so that, effectively, "is" takes precedence over
`__eq__`.  So a given NaN object will be recognized as repeated if it
appears more than once, but distinct NaN objects remain distinct:  So,
e.g.,

>>> from math import inf, nan
>>> import statistics
>>> statistics.mode([2, 2, nan, nan, nan])
nan

That's NOT "NaN-in, NaN-out", it's "a single NaN object is the object
that appeared most often".  Make those 3 distinct NaN objects (inf -
inf results) instead, and the mode changes:

>>> statistics.mode([2, 2, inf - inf, inf - inf, inf - inf])
2

Since the current behavior of `mean()` is the only one that's sane,
that should probably become  the default for all of them (NaN in ->
NaN out).

"NaN in -> exception" and "pretend NaNs in the input don't exist" are
the other possibly useful behaviors.

About median speed, I wouldn't worry.  Long ago I tried many
variations of QuickSelect, and it required very large inputs for a
Python-coded QuickSelect to run faster than a straightforward
.sort()+index.  It's bound to be worse now:

- Current Python .sort() is significantly faster on one-type lists
because it figures out the single type-specific comparison routine
needed once at the start, instead of enduring N log N full-blown
PyObject_RichCompareBool calls.

- And the current .sort() can be very much faster than older ones on
data with significant order.  In the limit, .sort()+index will run
faster than any QuickSelect variant on already-sorted or
already-reverse-sorted data.  QuickSelect variants aren't adaptive in
any sense, except that a "fat pivot" version (3-way partition, into <
pivot, == pivot, and > pivot regions) is very effective on data with
many equal values.

In Python 3.7.2, for randomly ordered random\-ish floats I find that
median() is significantly faster than mean() even on lists with
millions of elements, despite that the former sorts and the latter
doesn't.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-08 Thread Steven D'Aprano
On Tue, Jan 08, 2019 at 04:25:17PM +0900, Stephen J. Turnbull wrote:
> Steven D'Aprano writes:
> 
>  > By definition, data containing Not A Number values isn't numeric :-)
> 
> Unfortunately, that's just a joke, because in fact numeric functions
> produce NaNs.

I'm not sure if you're agreeing with me or disagreeing, so I'll assume 
you're agreeing and move on :-)


> I agree that this can easily be resolved by documenting that it is the
> caller's responsibility to remove NaNs from numeric data, but I prefer
> your proposed flags.
>
>  > The only reason why I don't call it a bug is that median() makes no 
>  > promises about NANs at all, any more than it makes promises about the 
>  > median of a list of sets or any other values which don't define a total 
>  > order.
> 
> Pedantically, I would prefer that the promise that ordinal data
> (vs. specifically numerical) has a median be made explicit, as there
> are many cases where statistical data is ordinal.

I think that is reasonable.

Provided the data defines a total order, the median is well-defined when 
there are an odd number of data points, or you can use median_low and 
median_high regardless of the number of data points.


> This may be a moot
> point, as in most cases ordinal data is represented numerically in
> computation (Likert scales, for example, are rarely coded as "hate,
> "dislike", "indifferent", "like", "love", but instead as 1, 2, 3, 4,
> 5), and from the point of view of UI presentation, IntEnums do the
> right thing here (print as identifiers, sort as integers).
> 
> Perhaps a better way to document this would be to suggest that ordinal
> data be represented using IntEnums?  (Again to be pedantic, one might
> want OrderedEnums that can be compared but don't allow other
> arithmetic operations.)

That's a nice solution.





-- 
Steve (the other one)
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-07 Thread Steven D'Aprano
On Mon, Jan 07, 2019 at 07:35:45PM +, MRAB wrote:

> Could the functions optionally accept a callback that will be called 
> when a NaN is first seen?
> 
> If the callback returns False, NaNs are suppressed, otherwise they are 
> retained and the function returns NaN (or whatever).

That's an interesting API which I shall have to think about.

> The callback would give the user a chance to raise a warning or an 
> exception, if desired.

One practical annoyance of this API is that you cannot include raise 
from a lambda, so people desiring "fail fast" semantics can't do this:

result = mean(data, callback=lambda: raise Exception)

They have to pre-declare the callback using def.



-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-07 Thread David Mertz
This callback idea feels way over-engineered for this module. It would
absolutely make sense in a more specialized numeric or statistical library.
But `statistics` feels to me like it should be only simple and basic
operations, with very few knobs attached.

On Mon, Jan 7, 2019, 2:36 PM MRAB  On 2019-01-07 16:34, Steven D'Aprano wrote:
> > On Mon, Jan 07, 2019 at 10:05:19AM -0500, David Mertz wrote:
> [snip]
> >>  It's not hard to manually check for NaNs and
> >> generate those in your own code.
> >
> > That is correct, but by that logic, we don't need to support *any* form
> > of NAN handling at all. It is easy (if inefficent) for the caller to
> > pre-filter their data. I want to make it easier and more convenient and
> > avoid having to iterate over the data twice if it isn't necessary.
> >
> Could the functions optionally accept a callback that will be called
> when a NaN is first seen?
>
> If the callback returns False, NaNs are suppressed, otherwise they are
> retained and the function returns NaN (or whatever).
>
> The callback would give the user a chance to raise a warning or an
> exception, if desired.
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-07 Thread MRAB

On 2019-01-07 16:34, Steven D'Aprano wrote:

On Mon, Jan 07, 2019 at 10:05:19AM -0500, David Mertz wrote:

[snip]

 It's not hard to manually check for NaNs and
generate those in your own code.


That is correct, but by that logic, we don't need to support *any* form
of NAN handling at all. It is easy (if inefficent) for the caller to
pre-filter their data. I want to make it easier and more convenient and
avoid having to iterate over the data twice if it isn't necessary.

Could the functions optionally accept a callback that will be called 
when a NaN is first seen?


If the callback returns False, NaNs are suppressed, otherwise they are 
retained and the function returns NaN (or whatever).


The callback would give the user a chance to raise a warning or an 
exception, if desired.

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-07 Thread David Mertz
On Mon, Jan 7, 2019 at 12:19 PM David Mertz  wrote:

> Under a partial ordering, a median may not be unique.  Even under a total
> ordering this is true if some subset of elements form an equivalence
> class.  But under partial ordering, the non-uniqueness can get much weirder.
>

I'm sure with more thought, weirder things can be thought of.  But just as
a quick example, it would be easy to write classes such that:

a < b < c < a

In such a case (or expand for an odd number of distinct things), it would
be reasonable to call ANY element of [a, b, c] a median. That's funny, but
it is not imprecise.

-- 
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-07 Thread David Mertz
On Mon, Jan 7, 2019, 11:38 AM Steven D'Aprano  Its not a bug in median(), because median requires the data implement a
> total order. Although that isn't explicitly documented, it is common sense:
> if the data cannot be sorted into smallest-to-largest order, how can you
> decide which value is in the middle?
>

I can see reason that median per-se requires a total order.  Yes, the
implementation chosen (and many reasonable and obvious implementations)
make that assumption.  But here is a perfectly reasonable definition of
median:

* A median is an element of a collection such that 1/2 of all elements of
the collection are less than it.

Depending on how you interpret median, this element might also not be in
the original collection, but be some newly generated value that has that
property.  E.g. statistics.median([1,2,3,4]) == 2.5.

Under a partial ordering, a median may not be unique.  Even under a total
ordering this is true if some subset of elements form an equivalence
class.  But under partial ordering, the non-uniqueness can get much weirder.

What is explicitly documented is that median requires numeric data, and
> NANs aren't numbers. So the only bug here is the caller's failure to
> filter out NANs. If you pass it garbage data, you get garbage results.
>

OK, then we should either raise an exception or propagate the NaN if that
is the intended meaning of the function.  And obviously document that such
is the assumption.  NaN's *are* explicitly in the floating-point domain, so
it's fuzzy whether they are numeric or not, notwithstanding the name.

I'm very happy to push NaN-filtering to users (as NumPy does, although it
provides alternate functions for many reductions that incorporate this...
the basic ones always propagate NaNs though).


> Nevertheless, it is a perfectly reasonable thing to want to use data
> which may or may not contain NANs, and I want to enhance the statistics
> module to make it easier for the caller to handle NANs in whichever way
> they see fit. This is a new feature, not a bug fix.
>

I disagree about bug vs. feature.  The old behavior is simply and
unambiguously wrong, but was not previously noticed.  Obviously, the bug
does not affect most uses, which is why it was not noticed.


> If you truly believe that, then you should also believe that both
> list.sort() and the bisect module are buggy, for precisely the same
> reason.
>

I cannot perceive any close connection between the correct behavior of
statistics.mean() and that of list.sort() or bisect.  I know the concrete
implementation of the former uses the latter, but the answers for what is
RIGHT feel completely independent to me.

I doubt Quickselect will be immune to the problem of NANs. It too relies
> on comparisons, and while I don't know for sure that it requires a total
> order, I'd be surprised if it doesn't. Quickselect is basically a
> variant of Quicksort that only partially sorts the data.
>

Yes, I was thinking of trying to tweak Quickselect to handle NaNs during
the process.  I.e. probably terminate and propagate the NaN early, as soon
as one is encountered.  That might save much of the work if a NaN is
encountered early and most comparisons and moves can be avoided.  Of
course, I'm sure there is a worst case where almost all the work is done
before a NaN check is performed in some constructed example.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-07 Thread Guido van Rossum
On Mon, Jan 7, 2019 at 8:39 AM Steven D'Aprano  wrote:

> Its not a bug in median(), because median requires the data implement a
> total order. Although that isn't explicitly documented, it is common
> sense: if the data cannot be sorted into smallest-to-largest order, how
> can you decide which value is in the middle?
>
> What is explicitly documented is that median requires numeric data, and
> NANs aren't numbers. So the only bug here is the caller's failure to
> filter out NANs. If you pass it garbage data, you get garbage results.
>
> Nevertheless, it is a perfectly reasonable thing to want to use data
> which may or may not contain NANs, and I want to enhance the statistics
> module to make it easier for the caller to handle NANs in whichever way
> they see fit. This is a new feature, not a bug fix.
>

So then you are arguing that making reasonable treatment of NANs the
default is not breaking backwards compatibility (because previously the
data was considered wrong). This sounds like a good idea to me. Presumably
the NANs are inserted into the data explicitly in order to signal missing
data -- this seems more plausible to me (given the typical use case for the
statistics module) than that they would be the result of a computation like
Inf/Inf. (While propagating NANs makes sense for the fundamental
arithmetical and mathematical functions, given that we have chosen not to
raise an error when encountering them, I think other stdlib libraries are
not beholden to that behavior.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-07 Thread Steven D'Aprano
On Mon, Jan 07, 2019 at 10:05:19AM -0500, David Mertz wrote:
> On Mon, Jan 7, 2019 at 6:50 AM Steven D'Aprano  wrote:
> 
> > > I'll provide a suggested batch on the bug.  It will simply be a wholly
> > > different implementation of median and friends.
> >
> > I ask for a documentation patch and you start talking about a whole new
> > implementation. Huh.
> > A new implementation with precisely the same behaviour is a waste of
> > time, so I presume you're planning to change the behaviour. How about if
> > you start off by explaining what the new semantics are?
> >
> 
> I think it would be counter-productive to document the bug (as something
> other than a bug).

Its not a bug in median(), because median requires the data implement a 
total order. Although that isn't explicitly documented, it is common 
sense: if the data cannot be sorted into smallest-to-largest order, how 
can you decide which value is in the middle?

What is explicitly documented is that median requires numeric data, and 
NANs aren't numbers. So the only bug here is the caller's failure to 
filter out NANs. If you pass it garbage data, you get garbage results.

Nevertheless, it is a perfectly reasonable thing to want to use data 
which may or may not contain NANs, and I want to enhance the statistics 
module to make it easier for the caller to handle NANs in whichever way 
they see fit. This is a new feature, not a bug fix.


> Picking what is a completely arbitrary element in face
> of a non-total order can never be "correct" behavior, and is never worth
> preserving for compatibility.

If you truly believe that, then you should also believe that both 
list.sort() and the bisect module are buggy, for precisely the same 
reason.

Perhaps you ought to raise a couple of bug reports, and see if you can 
get Tim and Raymond to agree that sorting and bisect should do something 
other than what they already do in the face of data that doesn't define 
a total order.


> I think the use of statistics.median against
> partially ordered elements is simply rare enough that no one tripped
> against it, or at least no one reported it before.

I'm sure it is rare. Nevertheless, I still want to make it easier for 
people to deal with this case.


> Notice that the code itself pretty much recognizes the bug in this comment:
> 
> # FIXME: investigate ways to calculate medians without sorting? Quickselect?

I doubt Quickselect will be immune to the problem of NANs. It too relies 
on comparisons, and while I don't know for sure that it requires a total 
order, I'd be surprised if it doesn't. Quickselect is basically a 
variant of Quicksort that only partially sorts the data.


> So it seems like the original author knew the implementation was wrong.

That's not why I put that comment in. Sorting is O(N log N) on average, 
and Quickselect can be O(N) on average. In principle, Quickselect or a 
similar selection algorithm could be faster than sorting.


[...]
>  It's not hard to manually check for NaNs and
> generate those in your own code.

That is correct, but by that logic, we don't need to support *any* form 
of NAN handling at all. It is easy (if inefficent) for the caller to 
pre-filter their data. I want to make it easier and more convenient and 
avoid having to iterate over the data twice if it isn't necessary.



-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-07 Thread David Mertz
On Mon, Jan 7, 2019 at 6:50 AM Steven D'Aprano  wrote:

> > I'll provide a suggested batch on the bug.  It will simply be a wholly
> > different implementation of median and friends.
>
> I ask for a documentation patch and you start talking about a whole new
> implementation. Huh.
> A new implementation with precisely the same behaviour is a waste of
> time, so I presume you're planning to change the behaviour. How about if
> you start off by explaining what the new semantics are?
>

I think it would be counter-productive to document the bug (as something
other than a bug).  Picking what is a completely arbitrary element in face
of a non-total order can never be "correct" behavior, and is never worth
preserving for compatibility.  I think the use of statistics.median against
partially ordered elements is simply rare enough that no one tripped
against it, or at least no one reported it before.

Notice that the code itself pretty much recognizes the bug in this comment:

# FIXME: investigate ways to calculate medians without sorting? Quickselect?


So it seems like the original author knew the implementation was wrong.
But you're right, the new behavior needs to be decided.  Propagating NaNs
is reasonable.  Filtering out NaN's is reasonable.  Those are the default
behaviors of NumPy and Pandas, respectively:

np.median([1,2,3,nan]) # -> nan
pd.Series([1,2,3,nan]).median() # -> 2.0

(Yes, of course there are ways in each to get the other behavior).  Other
non-Python tools similarly suggest one of those behaviors, but really
nothing else.

So yeah, what I was suggesting as a patch was an implementation that had
PROPAGATE and IGNORE semantics.  I don't have a real opinion about which
should be the default, but the current behavior should simply not exist at
all.  As I think about it, warnings and exceptions are really too complex
an API for this module.  It's not hard to manually check for NaNs and
generate those in your own code.

-- 
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-07 Thread Steven D'Aprano
On Mon, Jan 07, 2019 at 02:01:34PM +, Jonathan Fine wrote:

> Finally, I suggest that we might learn from
> ==
> Fix some special cases in Fractions?
> https://mail.python.org/pipermail/python-ideas/2018-August/053083.html
> ==

I remember that thread from August, and I've just re-read the entire 
thing now, and I don't see the relevance. Can you explain why you think 
it is relevant to this thread?



-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-07 Thread Jonathan Fine
Happy New Year (off topic).

Based on a quick review of the python docs, the bug report, PEP 450
and this thread, I suggest

1. More carefully draw attention to the NaN feature, in the
documentation for existing Python versions.
2. Consider revising statistics.py so that it raises an exception,
when passed NaN data.

https://www.python.org/dev/peps/pep-0450/#rationale says

The proposed statistics module is motivated by the "batteries
included" philosophy towards the Python standard library. Raymond
Hettinger and other senior developers have requested a quality
statistics library that falls somewhere in between high-end statistics
libraries and ad hoc code. Statistical functions such as mean,
standard deviation and others are obvious and useful batteries,
familiar to any Secondary School student.


The PEP makes no mention of NaN. Was it in error, in not stating that
NaN data is admissable? Is NaN part of the "batteries familar to any
Secondary School student?".

https://docs.python.org/3/library/statistics.html says

This module provides functions for calculating mathematical statistics
of numeric (Real-valued) data.


Some people regard NaN as not being a real-valued number. (Hint:
There's a clue in the name: Not A Number.)

Note that statistics.py already raises StatisticsError, when it
regards the data as flawed.

Finally, I suggest that we might learn from
==
Fix some special cases in Fractions?
https://mail.python.org/pipermail/python-ideas/2018-August/053083.html
==

I'll put a brief summary of my message into the bug tracker for this issue.

-- 
Jonathan
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-07 Thread Steven D'Aprano
On Mon, Jan 07, 2019 at 01:34:47AM -0500, David Mertz wrote:

> > I'm not opposed to documenting this better. Patches welcome :-)
> >
> 
> I'll provide a suggested batch on the bug.  It will simply be a wholly
> different implementation of median and friends.

I ask for a documentation patch and you start talking about a whole new 
implementation. Huh.

A new implementation with precisely the same behaviour is a waste of 
time, so I presume you're planning to change the behaviour. How about if 
you start off by explaining what the new semantics are?



-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-07 Thread Antoine Pitrou
On Sun, 6 Jan 2019 19:40:32 -0800
Stephan Hoyer  wrote:
> On Sun, Jan 6, 2019 at 4:27 PM Steven D'Aprano  wrote:
> 
> > I propose adding a "nan_policy" keyword-only parameter to the relevant
> > statistics functions (mean, median, variance etc), and defining the
> > following policies:
> >
> > IGNORE:  quietly ignore all NANs
> > FAIL:  raise an exception if any NAN is seen in the data
> > PASS:  pass NANs through unchanged (the default)
> > RETURN:  return a NAN if any NAN is seen in the data
> > WARN:  ignore all NANs but raise a warning if one is seen
> >  
> 
> I don't think PASS should be the default behavior, and I'm not sure it
> would be productive to actually implement all of these options.
> 
> For reference, NumPy and pandas (the two most popular packages for data
> analytics in Python) support two of these modes:
> - RETURN (numpy.mean() and skipna=False for pandas)
> - IGNORE (numpy.nanmean() and skipna=True for pandas)
> 
> RETURN is the default behavior for NumPy; IGNORE is the default for pandas.

I agree with Stephan that RETURN and IGNORE are the only useful modes
of operation here.

Regards

Antoine.


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-07 Thread Steven D'Aprano
(By the way, I'm not outright disagreeing with you, I'm trying to weigh 
up the pros and cons of your position. You've given me a lot to think 
about. More below.)

On Sun, Jan 06, 2019 at 11:31:30PM -0800, Nathaniel Smith wrote:
> On Sun, Jan 6, 2019 at 11:06 PM Steven D'Aprano  wrote:
> > I'm not wedded to the idea that the default ought to be the current
> > behaviour. If there is a strong argument for one of the others, I'm
> > listening.
> 
> "Errors should never pass silently"? Silently returning nonsensical
> results is hard to defend as a default behavior IMO :-)

If you violate the assumptions of the function, just about everything 
can in principle return nonsensical results. True, most of the time you 
have to work hard at it:

class MyList(list):
def __len__(self):
return random.randint(0, sys.maxint)

but it isn't unreasonable to document the assumptions of a function, and 
if the caller violates those assumptions, Garbage In Garbage Out 
applies.

E.g. bisect requires that your list is sorted in ascending order. If it 
isn't, the results you get are nonsensical.

py> data = [8, 6, 4, 2, 0]
py> bisect.bisect(data, 1)
0

That's not a bug in bisect, that's a bug in the caller's code, and it 
isn't bisect's responsibility to fix it.

Although it could be documented better, that's the current situation 
with NANs and median(). Data with NANs don't have a total ordering, and 
total ordering is the unstated assumption behind the idea of a median or 
middle value. So all bets are off.

 
> > How would you answer those who say that the right behaviour is not to
> > propogate unwanted NANs, but to fail fast and raise an exception?
> 
> Both seem defensible a priori, but every other mathematical operation
> in Python propagates NaNs instead of raising an exception. Is there
> something unusual about median that would justify giving it unusual
> behavior?

Well, not everything... 

py> NAN/0
Traceback (most recent call last):
  File "", line 1, in 
ZeroDivisionError: float division by zero


There may be others. But I'm not sure that "everything else does it" is 
a strong justification. It is *a* justification, since consistency is 
good, but consistency does not necessarily outweigh other concerns.

One possible argument for making PASS the default, even if that means 
implementation-dependent behaviour with NANs, is that in the absense of 
a clear preference for FAIL or RETURN, at least PASS is backwards 
compatible.

You might shoot yourself in the foot, but at least you know its the same 
foot you shot yourself in using the previous version *wink*



-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-06 Thread Nathaniel Smith
On Sun, Jan 6, 2019 at 11:06 PM Steven D'Aprano  wrote:
> I'm not wedded to the idea that the default ought to be the current
> behaviour. If there is a strong argument for one of the others, I'm
> listening.

"Errors should never pass silently"? Silently returning nonsensical
results is hard to defend as a default behavior IMO :-)

> How would you answer those who say that the right behaviour is not to
> propogate unwanted NANs, but to fail fast and raise an exception?

Both seem defensible a priori, but every other mathematical operation
in Python propagates NaNs instead of raising an exception. Is there
something unusual about median that would justify giving it unusual
behavior?

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-06 Thread Steven D'Aprano
On Sun, Jan 06, 2019 at 07:40:32PM -0800, Stephan Hoyer wrote:
> On Sun, Jan 6, 2019 at 4:27 PM Steven D'Aprano  wrote:
> 
> > I propose adding a "nan_policy" keyword-only parameter to the relevant
> > statistics functions (mean, median, variance etc), and defining the
> > following policies:
> >
> > IGNORE:  quietly ignore all NANs
> > FAIL:  raise an exception if any NAN is seen in the data
> > PASS:  pass NANs through unchanged (the default)
> > RETURN:  return a NAN if any NAN is seen in the data
> > WARN:  ignore all NANs but raise a warning if one is seen
> >
> 
> I don't think PASS should be the default behavior, and I'm not sure it
> would be productive to actually implement all of these options.

I'm not wedded to the idea that the default ought to be the current 
behaviour. If there is a strong argument for one of the others, I'm 
listening.


> For reference, NumPy and pandas (the two most popular packages for data
> analytics in Python) support two of these modes:
> - RETURN (numpy.mean() and skipna=False for pandas)
> - IGNORE (numpy.nanmean() and skipna=True for pandas)
> 
> RETURN is the default behavior for NumPy; IGNORE is the default for pandas.
> 
> I'm pretty sure RETURN is the right default behavior for Python's standard
> library and anything else should be considered a bug. It safely propagates
> NaNs, along the lines of IEEE float behavior.

How would you answer those who say that the right behaviour is not to 
propogate unwanted NANs, but to fail fast and raise an exception?


> I'm not sure what the use cases are for PASS, FAIL, or WARN, none of which
> are supported by NumPy or pandas:
> - PASS is a license to return silently incorrect results, in return for
> very marginal performance benefits.

By my (very rough) preliminary testing, the cost of checking for NANs 
doubles the cost of calculating the median, and increases the cost of 
calculating the mean() by 25%.

I'm not trying to compete with statistics libraries written in C for 
speed, but that doesn't mean I don't care about performance at all. The 
statistics library is already slower than I like and I don't want to 
slow it down further for the common case (numeric data with no NANs) for 
the sake of the uncommon case (data with NANs).

But I hear you about the "return silently incorrect results" part.

Fortunately, I think that only applies to sort-based functions like 
median(). mean() etc ought to propogate NANs with any reasonable 
implementation, but I'm reluctant to make that a guarantee in case I 
come up with some unreasonable implementation :-)


> This seems at odds with the intended
> focus of the statistics module on correctness over speed. Returning
> incorrect statistics should not be considered a feature that needs to be
> maintained.

It is only incorrect because the data violates the documented 
requirement that it be *numeric data*, and the undocumented requirement 
that the numbers have a total order. (So complex numbers are out.) I 
admit that the docs could be improved, but there are no guarantees made 
about NANs.

This doesn't mean I don't want to improve the situation! Far from it, 
hence this discussion.


> - FAIL would make sense if statistics functions could introduce *new* NaN
> values. But as far as I can tell, statistics functions already raise
> StatisticsError in these cases (e.g., if zero data point are provided). If
> users are concerned about accidentally propagating NaNs, they should be
> encouraged to check for NaNs at the entry points of their code.

As far as I can tell, there are two kinds of people when it comes to 
NANs: those who think that signalling NANs are a waste of time and NANs 
should always propogate, and those who hate NANs and wish that they 
would always signal (raise an exception).

I'm not going to get into an argument about who is right or who is 
wrong.


> - WARN is even less useful than FAIL. Seriously, who likes warnings?

Me :-)


> NumPy
> uses this approach for in array operations that produce NaNs (e.g., when
> dividing by zero), because *some* but not all results may be valid. But
> statistics functions return scalars.
> 
> I'm not even entirely sure it makes sense to add the IGNORE option, or at
> least to add it only for NaN. None is also a reasonable sentinel for a
> missing value in Python, and user defined types (e.g., pandas.NaT) also
> fall in this category. It seems a little strange to single NaN out in
> particular.

I am considering adding support for a dedicated "missing" value, whether 
it is None or a special sentinel. But one thing at a time. Ignoring NANs 
is moderately common in other statistics libraries, and although I 
personally feel that NANs shouldn't be used for missing values, I know 
many people do so.


-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: 

Re: [Python-ideas] NAN handling in the statistics module

2019-01-06 Thread David Mertz
On Mon, Jan 7, 2019 at 1:27 AM Steven D'Aprano  wrote:

> > In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5])
> > Out[4]: 1
> > In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4])
> > Out[5]: nan
>
> The second is possibly correct if one thinks that the median of a list
> containing NAN should return NAN -- but its only correct by accident,
> not design.
>

Exactly... in the second example, the nan just happens to wind up "in the
middle" of the sorted() list.  The fact that is the return value has
nothing to do propagating the nan (if it did, I think it would be a
reasonable answer).  I contrived the examples to get these... the first
answer which is the "most wrong number" is also selected for the same
reason than a nan is "near the middle."


> I'm not opposed to documenting this better. Patches welcome :-)
>

I'll provide a suggested batch on the bug.  It will simply be a wholly
different implementation of median and friends.


> There are at least three correct behaviours in the face of data
> containing NANs: propogate a NAN result, fail fast with an exception, or
> treat NANs as missing data that can be ignored. Only the caller can
> decide which is the right policy for their data set.


I'm not sure that raising right away is necessary as an option.  That feels
like something a user could catch at the end when they get a NaN result.
But those seem reasonable as three options.


-- 
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-06 Thread Steven D'Aprano
On Sun, Jan 06, 2019 at 10:52:47PM -0500, David Mertz wrote:

> Playing with Tim's examples, this suggests that statistics.median() is
> simply outright WRONG.  I can think of absolutely no way to characterize
> these as reasonable results:
> 
> Python 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 09:50:42)
> In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5])
> Out[4]: 1
> In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4])
> Out[5]: nan

The second is possibly correct if one thinks that the median of a list 
containing NAN should return NAN -- but its only correct by accident, 
not design.

As I wrote on the bug tracker:

"I agree that the current implementation-dependent behaviour when there 
are NANs in the data is troublesome."

The only reason why I don't call it a bug is that median() makes no 
promises about NANs at all, any more than it makes promises about the 
median of a list of sets or any other values which don't define a total 
order. help(median) says:

Return the median (middle value) of numeric data.


By definition, data containing Not A Number values isn't numeric :-)

I'm not opposed to documenting this better. Patches welcome :-)

There are at least three correct behaviours in the face of data 
containing NANs: propogate a NAN result, fail fast with an exception, or 
treat NANs as missing data that can be ignored. Only the caller can 
decide which is the right policy for their data set.

Aside: the IEEE-754 standard provides both signalling and quiet NANs. It 
is hard and unreliable to generate signalling float NANs in Python, but 
we can do it with Decimal:

py> from statistics import median
py> from decimal import Decimal
py> median([1, 3, 4, Decimal("sNAN"), 2])
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/lib/python3.5/statistics.py", line 349, in median
data = sorted(data)
decimal.InvalidOperation: []


In principle, one ought to be able to construct float signalling NANs 
too, but unfortunately that's platform dependent:

https://mail.python.org/pipermail/python-dev/2018-November/155713.html

Back to the topic on hand: I agree that median() does "the wrong thing" 
when NANs are involved, but there is no one "right thing" that we can do 
in its place. People disagree as to whether NANs should propogate, or 
raise, or be treated as missing data, and I see good arguments for all 
three.


-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-06 Thread Tim Peters
[David Mertz ]
> OK, let me be more precise.  Obviously if the implementation in a class is:
>
> class Foo:
> def __lt__(self, other):
> return random.random() < 0.5
>
> Then we aren't going to rely on much.
>
> * If comparison of any two items in a list (under __lt__) is deterministic, is
> the resulting sort order deterministic? (Pretty sure this is a yes)

Yes, but not defined unless __lt__ also defines a total ordering.

> * If the pairwise comparisons are deterministic, is sorting idempotent?

Not necessarily.  For example, the 2-element list here swaps its
elements every time `.sort()` is
invoked, because the second element always claims it's "less than" the
first element, regardless of which order they're in:

class RelentlesslyTiny:
def __init__(self, name):
self.name = name
def __repr__(self):
return self.name
def __lt__(self, other):
return self is not other

a = RelentlesslyTiny("A")
b = RelentlesslyTiny("B")
xs = [a, b]
print(xs)
xs.sort()
print("after sorting once", xs)
xs.sort()
print("after sorting twice", xs)

[A, B]
after sorting once [B, A]
after sorting twice [A, B]

> This statement is certainly false:
>
> * If two items are equal, and pairwise inequality is deterministic, exchanging
> the items does not affect the sorting of other items in the list.

What I said at the start ;-)  The only thing .sort() always guarantees
regardless of how goofy __lt__ may be is that the result list will be
some permutation of the input list.  This is so even if __lt__ raises
an uncaught exception, killing the sort mid-stream.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-06 Thread David Mertz
 This statement is certainly false:

>
> * If two items are equal, and pairwise inequality is deterministic,
> exchanging the items does not affect the sorting of other items in the list.
>

Just to demonstrate this obviousness:

>>> sorted([9, 9, 9, b, 1, 2, 3, a])
[1, 2, 3, A, B, 9, 9, 9]
>>> sorted([9, 9, 9, a, 1, 2, 3, b])
[B, 9, 9, 9, A, 1, 2, 3]
>>> a == b
True


The classes involved are:

class A:
def __lt__(self, other):
return False
__gt__ = __lt__
def __eq__(self, other):
return True
def __repr__(self):
return self.__class__.__name__

class B(A):
def __lt__(self, other):
return True
__gt__ = __lt__


I do not think these are useful, but __lt__ is deterministic here.

-- 
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-06 Thread Chris Angelico
On Mon, Jan 7, 2019 at 3:19 PM David Mertz  wrote:
>
> OK, let me be more precise.  Obviously if the implementation in a class is:
>
> class Foo:
> def __lt__(self, other):
> return random.random() < 0.5
>
>
> Then we aren't going to rely on much.
>
> * If comparison of any two items in a list (under __lt__) is deterministic, 
> is the resulting sort order deterministic? (Pretty sure this is a yes)

If you guarantee that exactly one of "x < y" and "y < x" is true for
any given pair of values from the list, and further guarantee that if
x < y and y < z then x < z, you have a total order. Without those two
guarantees, you could have deterministic comparisons (eg "nan < 5" is
always false, but so is "5 < nan"), but there's no way to truly put
the elements "in order". Defining __lt__ as "rock < paper", "paper <
scissors", "scissors < rock" means that you can't guarantee the sort
order, nor determinism.

Are those guarantees safe for your purposes? If so, sort() is, AIUI,
guaranteed to behave sanely.

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-06 Thread David Mertz
OK, let me be more precise.  Obviously if the implementation in a class is:

class Foo:
def __lt__(self, other):
return random.random() < 0.5


Then we aren't going to rely on much.

* If comparison of any two items in a list (under __lt__) is deterministic,
is the resulting sort order deterministic? (Pretty sure this is a yes)
* If the pairwise comparisons are deterministic, is sorting idempotent?

This statement is certainly false:

* If two items are equal, and pairwise inequality is deterministic,
exchanging the items does not affect the sorting of other items in the list.

On Sun, Jan 6, 2019 at 11:09 PM Tim Peters  wrote:

> [David Mertz ]
> > Thanks Tim for clarifying.  Is it even the case that sorts are STABLE in
> > the face of non-total orderings under __lt__?  A couple quick examples
> > don't refute that, but what I tried was not very thorough, nor did I
> > think much about TimSort itself.
>
> I'm not clear on what "stable" could mean in the absence of a total
> ordering.  Not only does sort not assume __lt__ is a total ordering,
> it doesn't assume it's transitive, or even deterministic.  We really
> can't assume anything about potentially user-defined functions.
>
> What sort does guarantee is that the result list is some permutation
> of the input list, regardless of how insanely __lt__ may behave.  If
> __lt__ sanely defines a deterministic total order, then "stable" and
> "sorted" are guaranteed too, with their obvious meanings.
>


-- 
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-06 Thread Tim Peters
[David Mertz ]
> Thanks Tim for clarifying.  Is it even the case that sorts are STABLE in
> the face of non-total orderings under __lt__?  A couple quick examples
> don't refute that, but what I tried was not very thorough, nor did I
> think much about TimSort itself.

I'm not clear on what "stable" could mean in the absence of a total
ordering.  Not only does sort not assume __lt__ is a total ordering,
it doesn't assume it's transitive, or even deterministic.  We really
can't assume anything about potentially user-defined functions.

What sort does guarantee is that the result list is some permutation
of the input list, regardless of how insanely __lt__ may behave.  If
__lt__ sanely defines a deterministic total order, then "stable" and
"sorted" are guaranteed too, with their obvious meanings.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-06 Thread David Mertz
[... apologies if this is dup, got a bounce ...]

> [David Mertz ]
>> I have to say though that the existing behavior of
`statistics.median[_low|_high|]`
>> is SURPRISING if not outright wrong.  It is the behavior in existing
Python,
>> but it is very strange.
>>
>> The implementation simply does whatever `sorted()` does, which is an
>> implementation detail.  In particular, NaN's being neither less than nor
>> greater than any floating point number, just stay where they are during
>> sorting.
>
> I expect you inferred that from staring at a handful of examples, but
> it's illusion.  Python's sort uses only __lt__ comparisons, and if
> those don't implement a total ordering then _nothing_ is defined about
> sort's result (beyond that it's some permutation of the original
> list).

Thanks Tim for clarifying.  Is it even the case that sorts are STABLE in
the face of non-total orderings under __lt__?  A couple quick examples
don't refute that, but what I tried was not very thorough, nor did I
think much about TimSort itself.

> So, certainly, if you want median to be predictable in the presence of
> NaNs, sort's behavior in the presence of NaNs can't be relied on in
> any respect.

Playing with Tim's examples, this suggests that statistics.median() is
simply outright WRONG.  I can think of absolutely no way to characterize
these as reasonable results:

Python 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 09:50:42)
In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5])
Out[4]: 1
In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4])
Out[5]: nan
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-06 Thread Tim Peters
[David Mertz ]
> I have to say though that the existing behavior of 
> `statistics.median[_low|_high|]`
> is SURPRISING if not outright wrong.  It is the behavior in existing Python,
> but it is very strange.
>
> The implementation simply does whatever `sorted()` does, which is an
> implementation detail.  In particular, NaN's being neither less than nor
> greater than any floating point number, just stay where they are during
> sorting.

I expect you inferred that from staring at a handful of examples, but
it's illusion.  Python's sort uses only __lt__ comparisons, and if
those don't implement a total ordering then _nothing_ is defined about
sort's result (beyond that it's some permutation of the original
list).

There's nothing special about NaNs in this.  For example, if you sort
a list of sets, then "<" means subset inclusion, which doesn't define
a total ordering among sets in general either (unless for every pair
of sets in a specific list, one is a proper subset of the other - in
which case the list of sets will be sorted in order of increasing
cardinality).

> But that's a particular feature of TimSort.  Yes, we are guaranteed that sorts
> are stable; and we have rules about which things can and cannot be compared
> for inequality at all.  But beyond that, I do not think Python ever promised 
> that
> NaNs would remain in the same positions after sorting

We don't promise it, and it's not true.  For example,

>>> import math
>>> nan = math.nan
>>> xs = [0, 1, 2, 4, nan, 5, 3]
>>> sorted(xs)
[0, 1, 2, 3, 4, nan, 5]

The NaN happened to move "one place to the right" there.  There's no
point to analyzing "why" - it's purely an accident deriving from the
pattern of __lt__ outcomes the internals happened to invoke.  FYI, it
goes like so:

is 1 < 0?  No, so the first two are already sorted.
is 2 < 1?  No, so the first three are already sorted.
is 4 < 2?  No, so the first four are already sorted
is nan < 4?  No, so the first five are already sorted
is 5 < nan?  No, so the first six are already sorted
is 3 < 5?  Yes!

At that point a binary insertion is used to move 3 into place.

And none of timsort's "fancy" parts even come into play for lists so
small.  The patterns of comparisons the fancy parts invoke can be much
more involved.

At no point does the algorithm have any idea that there are NaNs in
the list - it only looks at boolean __lt__ outcomes.

So, certainly, if you want median to be predictable in the presence of
NaNs, sort's behavior in the presence of NaNs can't be relied on in
any respect.

>>> sorted([6, 5, nan, 4, 3, 2, 1])
[1, 2, 3, 4, 5, 6, nan]

>>> sorted([9, 9, 9, 9, 9, 9, nan, 1, 2, 3, 4, 5, 6])
[9, 9, 9, 9, 9, 9, nan, 1, 2, 3, 4, 5, 6]
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-06 Thread Stephan Hoyer
On Sun, Jan 6, 2019 at 4:27 PM Steven D'Aprano  wrote:

> I propose adding a "nan_policy" keyword-only parameter to the relevant
> statistics functions (mean, median, variance etc), and defining the
> following policies:
>
> IGNORE:  quietly ignore all NANs
> FAIL:  raise an exception if any NAN is seen in the data
> PASS:  pass NANs through unchanged (the default)
> RETURN:  return a NAN if any NAN is seen in the data
> WARN:  ignore all NANs but raise a warning if one is seen
>

I don't think PASS should be the default behavior, and I'm not sure it
would be productive to actually implement all of these options.

For reference, NumPy and pandas (the two most popular packages for data
analytics in Python) support two of these modes:
- RETURN (numpy.mean() and skipna=False for pandas)
- IGNORE (numpy.nanmean() and skipna=True for pandas)

RETURN is the default behavior for NumPy; IGNORE is the default for pandas.

I'm pretty sure RETURN is the right default behavior for Python's standard
library and anything else should be considered a bug. It safely propagates
NaNs, along the lines of IEEE float behavior.

I'm not sure what the use cases are for PASS, FAIL, or WARN, none of which
are supported by NumPy or pandas:
- PASS is a license to return silently incorrect results, in return for
very marginal performance benefits. This seems at odds with the intended
focus of the statistics module on correctness over speed. Returning
incorrect statistics should not be considered a feature that needs to be
maintained.
- FAIL would make sense if statistics functions could introduce *new* NaN
values. But as far as I can tell, statistics functions already raise
StatisticsError in these cases (e.g., if zero data point are provided). If
users are concerned about accidentally propagating NaNs, they should be
encouraged to check for NaNs at the entry points of their code.
- WARN is even less useful than FAIL. Seriously, who likes warnings? NumPy
uses this approach for in array operations that produce NaNs (e.g., when
dividing by zero), because *some* but not all results may be valid. But
statistics functions return scalars.

I'm not even entirely sure it makes sense to add the IGNORE option, or at
least to add it only for NaN. None is also a reasonable sentinel for a
missing value in Python, and user defined types (e.g., pandas.NaT) also
fall in this category. It seems a little strange to single NaN out in
particular.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-06 Thread David Mertz
I have to say though that the existing behavior of
`statistics.median[_low|_high|]` is SURPRISING if not outright wrong.  It
is the behavior in existing Python, but it is very strange.

The implementation simply does whatever `sorted()` does, which is an
implementation detail.  In particular, NaN's being neither less than nor
greater than any floating point number, just stay where they are during
sorting.  But that's a particular feature of TimSort.  Yes, we are
guaranteed that sorts are stable; and we have rules about which things can
and cannot be compared for inequality at all.  But beyond that, I do not
think Python ever promised that NaNs would remain in the same positions
after sorting if some other position was stable under a different sorting
algorithm.

So in the incredibly unlikely even I invent a DavidSort that behaves better
than TimSort, is stable, and compares only the same Python objects as
current CPython, a future version could use this algorithm without breaking
promises... even if NaN's sometimes sorted differently than in TimSort.
For that matter, some new implementation could use my not-nearly-as-good
DavidSort, and while being slower, would still be compliant.

Relying on that for the result of `median()` feels strange to me.  It feels
strange as the default behavior, but that's the status quo.  But it feels
even stranger that there are not at least options to deal with NaNs in more
of the signaling or poisoning ways that every other numeric library does.

On Sun, Jan 6, 2019 at 7:28 PM Steven D'Aprano  wrote:

> Bug #33084 reports that the statistics library calculates median and
> other stats wrongly if the data contains NANs. Worse, the result depends
> on the initial placement of the NAN:
>
> py> from statistics import median
> py> NAN = float('nan')
> py> median([NAN, 1, 2, 3, 4])
> 2
> py> median([1, 2, 3, 4, NAN])
> 3
>
> See the bug report for more detail:
>
> https://bugs.python.org/issue33084
>
>
> The caller can always filter NANs out of their own data, but following
> the lead of some other stats packages, I propose a standard way for the
> statistics module to do so. I hope this will be uncontroversial (he
> says, optimistically...) but just in case, here is some prior art:
>
> (1) Nearly all R stats functions take a "na.rm" argument which defaults
> to False; if True, NA and NAN values will be stripped.
>
> (2) The scipy.stats.ttest_ind function takes a "nan_policy" argument
> which specifies what to do if a NAN is seen in the data.
>
>
> https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
>
> (3) At least some Matlab functions, such as mean(), take an optional
> flag that determines whether to ignore NANs or include them.
>
> https://au.mathworks.com/help/matlab/ref/mean.html#bt5b82t-1-nanflag
>
>
> I propose adding a "nan_policy" keyword-only parameter to the relevant
> statistics functions (mean, median, variance etc), and defining the
> following policies:
>
> IGNORE:  quietly ignore all NANs
> FAIL:  raise an exception if any NAN is seen in the data
> PASS:  pass NANs through unchanged (the default)
> RETURN:  return a NAN if any NAN is seen in the data
> WARN:  ignore all NANs but raise a warning if one is seen
>
> PASS is equivalent to saying that you, the caller, have taken full
> responsibility for filtering out NANs and there's no need for the
> function to slow down processing by doing so again. Either that, or you
> want the current implementation-dependent behaviour.
>
> FAIL is equivalent to treating all NANs as "signalling NANs". The
> presence of a NAN is an error.
>
> RETURN is equivalent to "NAN poisoning" -- the presence of a NAN in a
> calculation causes it to return a NAN, allowing NANs to propogate
> through multiple calculations.
>
> IGNORE and WARN are the same, except IGNORE is silent and WARN raises a
> warning.
>
> Questions:
>
> - does anyone have an serious objections to this?
>
> - what do you think of the names for the policies?
>
> - are there any additional policies that you would like to see?
>   (if so, please give use-cases)
>
> - are you happy with the default?
>
>
> Bike-shed away!
>
>
>
> --
> Steve
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>


-- 
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-06 Thread Steven D'Aprano
On Sun, Jan 06, 2019 at 07:46:03PM -0500, David Mertz wrote:

> Would these policies be named as strings or with an enum? Following Pandas,
> we'd probably support both.

Sure, I can support both.


> I won't bikeshed the names, but they seem to
> cover desired behaviors.

Good to hear.



-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] NAN handling in the statistics module

2019-01-06 Thread David Mertz
Would these policies be named as strings or with an enum? Following Pandas,
we'd probably support both. I won't bikeshed the names, but they seem to
cover desired behaviors.

On Sun, Jan 6, 2019, 7:28 PM Steven D'Aprano  Bug #33084 reports that the statistics library calculates median and
> other stats wrongly if the data contains NANs. Worse, the result depends
> on the initial placement of the NAN:
>
> py> from statistics import median
> py> NAN = float('nan')
> py> median([NAN, 1, 2, 3, 4])
> 2
> py> median([1, 2, 3, 4, NAN])
> 3
>
> See the bug report for more detail:
>
> https://bugs.python.org/issue33084
>
>
> The caller can always filter NANs out of their own data, but following
> the lead of some other stats packages, I propose a standard way for the
> statistics module to do so. I hope this will be uncontroversial (he
> says, optimistically...) but just in case, here is some prior art:
>
> (1) Nearly all R stats functions take a "na.rm" argument which defaults
> to False; if True, NA and NAN values will be stripped.
>
> (2) The scipy.stats.ttest_ind function takes a "nan_policy" argument
> which specifies what to do if a NAN is seen in the data.
>
>
> https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
>
> (3) At least some Matlab functions, such as mean(), take an optional
> flag that determines whether to ignore NANs or include them.
>
> https://au.mathworks.com/help/matlab/ref/mean.html#bt5b82t-1-nanflag
>
>
> I propose adding a "nan_policy" keyword-only parameter to the relevant
> statistics functions (mean, median, variance etc), and defining the
> following policies:
>
> IGNORE:  quietly ignore all NANs
> FAIL:  raise an exception if any NAN is seen in the data
> PASS:  pass NANs through unchanged (the default)
> RETURN:  return a NAN if any NAN is seen in the data
> WARN:  ignore all NANs but raise a warning if one is seen
>
> PASS is equivalent to saying that you, the caller, have taken full
> responsibility for filtering out NANs and there's no need for the
> function to slow down processing by doing so again. Either that, or you
> want the current implementation-dependent behaviour.
>
> FAIL is equivalent to treating all NANs as "signalling NANs". The
> presence of a NAN is an error.
>
> RETURN is equivalent to "NAN poisoning" -- the presence of a NAN in a
> calculation causes it to return a NAN, allowing NANs to propogate
> through multiple calculations.
>
> IGNORE and WARN are the same, except IGNORE is silent and WARN raises a
> warning.
>
> Questions:
>
> - does anyone have an serious objections to this?
>
> - what do you think of the names for the policies?
>
> - are there any additional policies that you would like to see?
>   (if so, please give use-cases)
>
> - are you happy with the default?
>
>
> Bike-shed away!
>
>
>
> --
> Steve
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/