Re: [Python-ideas] NAN handling in the statistics module
On Monday, January 7, 2019 at 3:16:07 AM UTC-5, Steven D'Aprano wrote: > > (By the way, I'm not outright disagreeing with you, I'm trying to weigh > up the pros and cons of your position. You've given me a lot to think > about. More below.) > > On Sun, Jan 06, 2019 at 11:31:30PM -0800, Nathaniel Smith wrote: > > On Sun, Jan 6, 2019 at 11:06 PM Steven D'Aprano > wrote: > > > I'm not wedded to the idea that the default ought to be the current > > > behaviour. If there is a strong argument for one of the others, I'm > > > listening. > > > > "Errors should never pass silently"? Silently returning nonsensical > > results is hard to defend as a default behavior IMO :-) > > If you violate the assumptions of the function, just about everything > can in principle return nonsensical results. True, most of the time you > have to work hard at it: > > class MyList(list): > def __len__(self): > return random.randint(0, sys.maxint) > > but it isn't unreasonable to document the assumptions of a function, and > if the caller violates those assumptions, Garbage In Garbage Out > applies. > I'm with Antoine, Nathaniel, David, and Chris: it is unreasonable to silently return nonsensical results even if you've documented it. Documenting it only makes it worse because it's like an "I told you so" when people finally figure out what's wrong and go to file the bug. > > E.g. bisect requires that your list is sorted in ascending order. If it > isn't, the results you get are nonsensical. > > py> data = [8, 6, 4, 2, 0] > py> bisect.bisect(data, 1) > 0 > > That's not a bug in bisect, that's a bug in the caller's code, and it > isn't bisect's responsibility to fix it. > > Although it could be documented better, that's the current situation > with NANs and median(). Data with NANs don't have a total ordering, and > total ordering is the unstated assumption behind the idea of a median or > middle value. So all bets are off. > > > > > How would you answer those who say that the right behaviour is not to > > > propogate unwanted NANs, but to fail fast and raise an exception? > > > > Both seem defensible a priori, but every other mathematical operation > > in Python propagates NaNs instead of raising an exception. Is there > > something unusual about median that would justify giving it unusual > > behavior? > > Well, not everything... > > py> NAN/0 > Traceback (most recent call last): > File "", line 1, in > ZeroDivisionError: float division by zero > > > There may be others. But I'm not sure that "everything else does it" is > a strong justification. It is *a* justification, since consistency is > good, but consistency does not necessarily outweigh other concerns. > > One possible argument for making PASS the default, even if that means > implementation-dependent behaviour with NANs, is that in the absense of > a clear preference for FAIL or RETURN, at least PASS is backwards > compatible. > > You might shoot yourself in the foot, but at least you know its the same > foot you shot yourself in using the previous version *wink* > > > > -- > Steve > ___ > Python-ideas mailing list > python...@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Wed, 9 Jan 2019 at 05:20, Steven D'Aprano wrote: > > On Mon, Jan 07, 2019 at 11:27:22AM +1100, Steven D'Aprano wrote: > > [...] > > I propose adding a "nan_policy" keyword-only parameter to the relevant > > statistics functions (mean, median, variance etc), and defining the > > following policies: > > > I asked some heavy users of statistics software (not just Python users) > what behaviour they would find useful, and as I feared, I got no > conclusive answer. So far, the answers seem to be almost evenly split > into four camps: > > - don't do anything, it is the caller's responsibility to filter NANs; > > - raise an immediate error; > > - return a NAN; > > - treat them as missing data. I would prefer to raise an exception in on nan. It's much easier to debug an exception than a nan. Take a look at the Julia docs for their statistics module: https://docs.julialang.org/en/v1/stdlib/Statistics/index.html In julia they have defined an explicit "missing" value. With that you can explicitly distinguish between a calculation error and missing data. The obvious Python equivalent would be None. > On consideration of all the views expressed, thank you to everyone who > commented, I'm now inclined to default to returning a NAN (which happens > to be the current behaviour of mean etc, but not median except by > accident) even if it impacts performance. Whichever way you go with this it might make sense to provide helper functions for users to deal with nans e.g.: xbar = mean(without_nans(data)) xbar = mode(replace_nans_with_None(data)) -- Oscar ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
I've just read statistics.py, and found something that might be usefully considered along with the NaN question. >>> median([1]) 1 >>> median([1, 1]) 1.0 To record this, and associated behaviour involving Fraction, I've added: Division by 2 in statistics.median: https://bugs.python.org/issue35698 -- Jonathan ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
[David Mertz ] > I think consistent NaN-poisoning would be excellent behavior. It will > always make sense for median (and its variants). > >> >>> statistics.mode([2, 2, nan, nan, nan]) >> nan >> >>> statistics.mode([2, 2, inf - inf, inf - inf, inf - inf]) >> 2 > > > But in the mode case, I'm not sure we should ALWAYS treat a NaN as > poisoning the result. I am: I thought about the following but didn't write about it because it's too strained to be of actual sane use ;-) > If NaN means "missing value" then sometimes it could change things, >?and we shouldn't guess. But what if it cannot? > > >>> statistics.mode([9, 9, 9, 9, nan1, nan2, nan3]) > > No matter what missing value we take those nans to maybe-possibly represent, 9 > is still the most common element. This is only true when the most common > thing > occurs at least as often as the 2nd most common thing PLUS the number > of all NaNs. But in that case, 9 really is the mode. See "too strained" above. It's equally true that, e.g., the _median_ of your list above: [9, 9, 9, 9, nan1, nan2, nan3] is also 9 regardless of what values are plugged in for the nans. That may be easier to realize at first with a simpler list, like [5, 5, nan] It sounds essentially useless to me, just theoretically possible to make a mess of implementations to cater to. "The right" (obvious, unsurprising, useful, easy to implement, easy to understand) non-exceptional behavior in the presence of NaNs is to pretend they weren't in the list to begin with. But I'd rather ;people ask for that _if_ that's what they want. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Tue, Jan 8, 2019 at 11:57 PM Tim Peters wrote: > I'd like to see internal consistency across the central-tendency > statistics in the presence of NaNs. What happens now: > I think consistent NaN-poisoning would be excellent behavior. It will always make sense for median (and its variants). >>> statistics.mode([2, 2, nan, nan, nan]) > nan > >>> statistics.mode([2, 2, inf - inf, inf - inf, inf - inf]) > 2 > But in the mode case, I'm not sure we should ALWAYS treat a NaN as poisoning the result. If NaN means "missing value" then sometimes it could change things, and we shouldn't guess. But what if it cannot? >>> statistics.mode([9, 9, 9, 9, nan1, nan2, nan3]) No matter what missing value we take those nans to maybe-possibly represent, 9 is still the most common element. This is only true when the most common thing occurs at least as often as the 2nd most common thing PLUS the number of all NaNs. But in that case, 9 really is the mode. We have one example of non-poisoning NaN in basic operations: >>> nan**0 1.0 So if the NaN "cannot possibly change the answer" then its reasonable to produce a non-NaN answer IMO. Except we don't really get that with 0**nan or 0*nan already... so a NaN-poisoning mode wouldn't actually offend my sensibilities that much. :-). I guess you could argue that NaN "could be inf". In that case 0*nan being nan makes sense. But this still feels hard to slightly odd: >>> 0**inf 0.0 >>> 0**nan nan I guess it's supported by: >>> 0**-1 ZeroDivisionError: 0.0 cannot be raised to a negative power A *missing value* could be a negative one. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Mon, Jan 07, 2019 at 11:27:22AM +1100, Steven D'Aprano wrote: [...] > I propose adding a "nan_policy" keyword-only parameter to the relevant > statistics functions (mean, median, variance etc), and defining the > following policies: I asked some heavy users of statistics software (not just Python users) what behaviour they would find useful, and as I feared, I got no conclusive answer. So far, the answers seem to be almost evenly split into four camps: - don't do anything, it is the caller's responsibility to filter NANs; - raise an immediate error; - return a NAN; - treat them as missing data. (Currently it is a small sample size, so I don't expect the answers will stay evenly split if more people answer.) On consideration of all the views expressed, thank you to everyone who commented, I'm now inclined to default to returning a NAN (which happens to be the current behaviour of mean etc, but not median except by accident) even if it impacts performance. -- Steve ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
I'd like to see internal consistency across the central-tendency statistics in the presence of NaNs. What happens now: mean: the code appears to guarantee that a NaN will be returned if a NaN is in the input. median: as recently detailed, just about anything can happen, depending on how undefined behaviors in .sort() interact. mode: while NaN != NaN at the Python level, internally dicts use an identity shortcut so that, effectively, "is" takes precedence over `__eq__`. So a given NaN object will be recognized as repeated if it appears more than once, but distinct NaN objects remain distinct: So, e.g., >>> from math import inf, nan >>> import statistics >>> statistics.mode([2, 2, nan, nan, nan]) nan That's NOT "NaN-in, NaN-out", it's "a single NaN object is the object that appeared most often". Make those 3 distinct NaN objects (inf - inf results) instead, and the mode changes: >>> statistics.mode([2, 2, inf - inf, inf - inf, inf - inf]) 2 Since the current behavior of `mean()` is the only one that's sane, that should probably become the default for all of them (NaN in -> NaN out). "NaN in -> exception" and "pretend NaNs in the input don't exist" are the other possibly useful behaviors. About median speed, I wouldn't worry. Long ago I tried many variations of QuickSelect, and it required very large inputs for a Python-coded QuickSelect to run faster than a straightforward .sort()+index. It's bound to be worse now: - Current Python .sort() is significantly faster on one-type lists because it figures out the single type-specific comparison routine needed once at the start, instead of enduring N log N full-blown PyObject_RichCompareBool calls. - And the current .sort() can be very much faster than older ones on data with significant order. In the limit, .sort()+index will run faster than any QuickSelect variant on already-sorted or already-reverse-sorted data. QuickSelect variants aren't adaptive in any sense, except that a "fat pivot" version (3-way partition, into < pivot, == pivot, and > pivot regions) is very effective on data with many equal values. In Python 3.7.2, for randomly ordered random\-ish floats I find that median() is significantly faster than mean() even on lists with millions of elements, despite that the former sorts and the latter doesn't. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Tue, Jan 08, 2019 at 04:25:17PM +0900, Stephen J. Turnbull wrote: > Steven D'Aprano writes: > > > By definition, data containing Not A Number values isn't numeric :-) > > Unfortunately, that's just a joke, because in fact numeric functions > produce NaNs. I'm not sure if you're agreeing with me or disagreeing, so I'll assume you're agreeing and move on :-) > I agree that this can easily be resolved by documenting that it is the > caller's responsibility to remove NaNs from numeric data, but I prefer > your proposed flags. > > > The only reason why I don't call it a bug is that median() makes no > > promises about NANs at all, any more than it makes promises about the > > median of a list of sets or any other values which don't define a total > > order. > > Pedantically, I would prefer that the promise that ordinal data > (vs. specifically numerical) has a median be made explicit, as there > are many cases where statistical data is ordinal. I think that is reasonable. Provided the data defines a total order, the median is well-defined when there are an odd number of data points, or you can use median_low and median_high regardless of the number of data points. > This may be a moot > point, as in most cases ordinal data is represented numerically in > computation (Likert scales, for example, are rarely coded as "hate, > "dislike", "indifferent", "like", "love", but instead as 1, 2, 3, 4, > 5), and from the point of view of UI presentation, IntEnums do the > right thing here (print as identifiers, sort as integers). > > Perhaps a better way to document this would be to suggest that ordinal > data be represented using IntEnums? (Again to be pedantic, one might > want OrderedEnums that can be compared but don't allow other > arithmetic operations.) That's a nice solution. -- Steve (the other one) ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Mon, Jan 07, 2019 at 07:35:45PM +, MRAB wrote: > Could the functions optionally accept a callback that will be called > when a NaN is first seen? > > If the callback returns False, NaNs are suppressed, otherwise they are > retained and the function returns NaN (or whatever). That's an interesting API which I shall have to think about. > The callback would give the user a chance to raise a warning or an > exception, if desired. One practical annoyance of this API is that you cannot include raise from a lambda, so people desiring "fail fast" semantics can't do this: result = mean(data, callback=lambda: raise Exception) They have to pre-declare the callback using def. -- Steve ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
This callback idea feels way over-engineered for this module. It would absolutely make sense in a more specialized numeric or statistical library. But `statistics` feels to me like it should be only simple and basic operations, with very few knobs attached. On Mon, Jan 7, 2019, 2:36 PM MRAB On 2019-01-07 16:34, Steven D'Aprano wrote: > > On Mon, Jan 07, 2019 at 10:05:19AM -0500, David Mertz wrote: > [snip] > >> It's not hard to manually check for NaNs and > >> generate those in your own code. > > > > That is correct, but by that logic, we don't need to support *any* form > > of NAN handling at all. It is easy (if inefficent) for the caller to > > pre-filter their data. I want to make it easier and more convenient and > > avoid having to iterate over the data twice if it isn't necessary. > > > Could the functions optionally accept a callback that will be called > when a NaN is first seen? > > If the callback returns False, NaNs are suppressed, otherwise they are > retained and the function returns NaN (or whatever). > > The callback would give the user a chance to raise a warning or an > exception, if desired. > ___ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On 2019-01-07 16:34, Steven D'Aprano wrote: On Mon, Jan 07, 2019 at 10:05:19AM -0500, David Mertz wrote: [snip] It's not hard to manually check for NaNs and generate those in your own code. That is correct, but by that logic, we don't need to support *any* form of NAN handling at all. It is easy (if inefficent) for the caller to pre-filter their data. I want to make it easier and more convenient and avoid having to iterate over the data twice if it isn't necessary. Could the functions optionally accept a callback that will be called when a NaN is first seen? If the callback returns False, NaNs are suppressed, otherwise they are retained and the function returns NaN (or whatever). The callback would give the user a chance to raise a warning or an exception, if desired. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Mon, Jan 7, 2019 at 12:19 PM David Mertz wrote: > Under a partial ordering, a median may not be unique. Even under a total > ordering this is true if some subset of elements form an equivalence > class. But under partial ordering, the non-uniqueness can get much weirder. > I'm sure with more thought, weirder things can be thought of. But just as a quick example, it would be easy to write classes such that: a < b < c < a In such a case (or expand for an odd number of distinct things), it would be reasonable to call ANY element of [a, b, c] a median. That's funny, but it is not imprecise. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Mon, Jan 7, 2019, 11:38 AM Steven D'Aprano Its not a bug in median(), because median requires the data implement a > total order. Although that isn't explicitly documented, it is common sense: > if the data cannot be sorted into smallest-to-largest order, how can you > decide which value is in the middle? > I can see reason that median per-se requires a total order. Yes, the implementation chosen (and many reasonable and obvious implementations) make that assumption. But here is a perfectly reasonable definition of median: * A median is an element of a collection such that 1/2 of all elements of the collection are less than it. Depending on how you interpret median, this element might also not be in the original collection, but be some newly generated value that has that property. E.g. statistics.median([1,2,3,4]) == 2.5. Under a partial ordering, a median may not be unique. Even under a total ordering this is true if some subset of elements form an equivalence class. But under partial ordering, the non-uniqueness can get much weirder. What is explicitly documented is that median requires numeric data, and > NANs aren't numbers. So the only bug here is the caller's failure to > filter out NANs. If you pass it garbage data, you get garbage results. > OK, then we should either raise an exception or propagate the NaN if that is the intended meaning of the function. And obviously document that such is the assumption. NaN's *are* explicitly in the floating-point domain, so it's fuzzy whether they are numeric or not, notwithstanding the name. I'm very happy to push NaN-filtering to users (as NumPy does, although it provides alternate functions for many reductions that incorporate this... the basic ones always propagate NaNs though). > Nevertheless, it is a perfectly reasonable thing to want to use data > which may or may not contain NANs, and I want to enhance the statistics > module to make it easier for the caller to handle NANs in whichever way > they see fit. This is a new feature, not a bug fix. > I disagree about bug vs. feature. The old behavior is simply and unambiguously wrong, but was not previously noticed. Obviously, the bug does not affect most uses, which is why it was not noticed. > If you truly believe that, then you should also believe that both > list.sort() and the bisect module are buggy, for precisely the same > reason. > I cannot perceive any close connection between the correct behavior of statistics.mean() and that of list.sort() or bisect. I know the concrete implementation of the former uses the latter, but the answers for what is RIGHT feel completely independent to me. I doubt Quickselect will be immune to the problem of NANs. It too relies > on comparisons, and while I don't know for sure that it requires a total > order, I'd be surprised if it doesn't. Quickselect is basically a > variant of Quicksort that only partially sorts the data. > Yes, I was thinking of trying to tweak Quickselect to handle NaNs during the process. I.e. probably terminate and propagate the NaN early, as soon as one is encountered. That might save much of the work if a NaN is encountered early and most comparisons and moves can be avoided. Of course, I'm sure there is a worst case where almost all the work is done before a NaN check is performed in some constructed example. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Mon, Jan 7, 2019 at 8:39 AM Steven D'Aprano wrote: > Its not a bug in median(), because median requires the data implement a > total order. Although that isn't explicitly documented, it is common > sense: if the data cannot be sorted into smallest-to-largest order, how > can you decide which value is in the middle? > > What is explicitly documented is that median requires numeric data, and > NANs aren't numbers. So the only bug here is the caller's failure to > filter out NANs. If you pass it garbage data, you get garbage results. > > Nevertheless, it is a perfectly reasonable thing to want to use data > which may or may not contain NANs, and I want to enhance the statistics > module to make it easier for the caller to handle NANs in whichever way > they see fit. This is a new feature, not a bug fix. > So then you are arguing that making reasonable treatment of NANs the default is not breaking backwards compatibility (because previously the data was considered wrong). This sounds like a good idea to me. Presumably the NANs are inserted into the data explicitly in order to signal missing data -- this seems more plausible to me (given the typical use case for the statistics module) than that they would be the result of a computation like Inf/Inf. (While propagating NANs makes sense for the fundamental arithmetical and mathematical functions, given that we have chosen not to raise an error when encountering them, I think other stdlib libraries are not beholden to that behavior.) -- --Guido van Rossum (python.org/~guido) ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Mon, Jan 07, 2019 at 10:05:19AM -0500, David Mertz wrote: > On Mon, Jan 7, 2019 at 6:50 AM Steven D'Aprano wrote: > > > > I'll provide a suggested batch on the bug. It will simply be a wholly > > > different implementation of median and friends. > > > > I ask for a documentation patch and you start talking about a whole new > > implementation. Huh. > > A new implementation with precisely the same behaviour is a waste of > > time, so I presume you're planning to change the behaviour. How about if > > you start off by explaining what the new semantics are? > > > > I think it would be counter-productive to document the bug (as something > other than a bug). Its not a bug in median(), because median requires the data implement a total order. Although that isn't explicitly documented, it is common sense: if the data cannot be sorted into smallest-to-largest order, how can you decide which value is in the middle? What is explicitly documented is that median requires numeric data, and NANs aren't numbers. So the only bug here is the caller's failure to filter out NANs. If you pass it garbage data, you get garbage results. Nevertheless, it is a perfectly reasonable thing to want to use data which may or may not contain NANs, and I want to enhance the statistics module to make it easier for the caller to handle NANs in whichever way they see fit. This is a new feature, not a bug fix. > Picking what is a completely arbitrary element in face > of a non-total order can never be "correct" behavior, and is never worth > preserving for compatibility. If you truly believe that, then you should also believe that both list.sort() and the bisect module are buggy, for precisely the same reason. Perhaps you ought to raise a couple of bug reports, and see if you can get Tim and Raymond to agree that sorting and bisect should do something other than what they already do in the face of data that doesn't define a total order. > I think the use of statistics.median against > partially ordered elements is simply rare enough that no one tripped > against it, or at least no one reported it before. I'm sure it is rare. Nevertheless, I still want to make it easier for people to deal with this case. > Notice that the code itself pretty much recognizes the bug in this comment: > > # FIXME: investigate ways to calculate medians without sorting? Quickselect? I doubt Quickselect will be immune to the problem of NANs. It too relies on comparisons, and while I don't know for sure that it requires a total order, I'd be surprised if it doesn't. Quickselect is basically a variant of Quicksort that only partially sorts the data. > So it seems like the original author knew the implementation was wrong. That's not why I put that comment in. Sorting is O(N log N) on average, and Quickselect can be O(N) on average. In principle, Quickselect or a similar selection algorithm could be faster than sorting. [...] > It's not hard to manually check for NaNs and > generate those in your own code. That is correct, but by that logic, we don't need to support *any* form of NAN handling at all. It is easy (if inefficent) for the caller to pre-filter their data. I want to make it easier and more convenient and avoid having to iterate over the data twice if it isn't necessary. -- Steve ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Mon, Jan 7, 2019 at 6:50 AM Steven D'Aprano wrote: > > I'll provide a suggested batch on the bug. It will simply be a wholly > > different implementation of median and friends. > > I ask for a documentation patch and you start talking about a whole new > implementation. Huh. > A new implementation with precisely the same behaviour is a waste of > time, so I presume you're planning to change the behaviour. How about if > you start off by explaining what the new semantics are? > I think it would be counter-productive to document the bug (as something other than a bug). Picking what is a completely arbitrary element in face of a non-total order can never be "correct" behavior, and is never worth preserving for compatibility. I think the use of statistics.median against partially ordered elements is simply rare enough that no one tripped against it, or at least no one reported it before. Notice that the code itself pretty much recognizes the bug in this comment: # FIXME: investigate ways to calculate medians without sorting? Quickselect? So it seems like the original author knew the implementation was wrong. But you're right, the new behavior needs to be decided. Propagating NaNs is reasonable. Filtering out NaN's is reasonable. Those are the default behaviors of NumPy and Pandas, respectively: np.median([1,2,3,nan]) # -> nan pd.Series([1,2,3,nan]).median() # -> 2.0 (Yes, of course there are ways in each to get the other behavior). Other non-Python tools similarly suggest one of those behaviors, but really nothing else. So yeah, what I was suggesting as a patch was an implementation that had PROPAGATE and IGNORE semantics. I don't have a real opinion about which should be the default, but the current behavior should simply not exist at all. As I think about it, warnings and exceptions are really too complex an API for this module. It's not hard to manually check for NaNs and generate those in your own code. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Mon, Jan 07, 2019 at 02:01:34PM +, Jonathan Fine wrote: > Finally, I suggest that we might learn from > == > Fix some special cases in Fractions? > https://mail.python.org/pipermail/python-ideas/2018-August/053083.html > == I remember that thread from August, and I've just re-read the entire thing now, and I don't see the relevance. Can you explain why you think it is relevant to this thread? -- Steve ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
Happy New Year (off topic). Based on a quick review of the python docs, the bug report, PEP 450 and this thread, I suggest 1. More carefully draw attention to the NaN feature, in the documentation for existing Python versions. 2. Consider revising statistics.py so that it raises an exception, when passed NaN data. https://www.python.org/dev/peps/pep-0450/#rationale says The proposed statistics module is motivated by the "batteries included" philosophy towards the Python standard library. Raymond Hettinger and other senior developers have requested a quality statistics library that falls somewhere in between high-end statistics libraries and ad hoc code. Statistical functions such as mean, standard deviation and others are obvious and useful batteries, familiar to any Secondary School student. The PEP makes no mention of NaN. Was it in error, in not stating that NaN data is admissable? Is NaN part of the "batteries familar to any Secondary School student?". https://docs.python.org/3/library/statistics.html says This module provides functions for calculating mathematical statistics of numeric (Real-valued) data. Some people regard NaN as not being a real-valued number. (Hint: There's a clue in the name: Not A Number.) Note that statistics.py already raises StatisticsError, when it regards the data as flawed. Finally, I suggest that we might learn from == Fix some special cases in Fractions? https://mail.python.org/pipermail/python-ideas/2018-August/053083.html == I'll put a brief summary of my message into the bug tracker for this issue. -- Jonathan ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Mon, Jan 07, 2019 at 01:34:47AM -0500, David Mertz wrote: > > I'm not opposed to documenting this better. Patches welcome :-) > > > > I'll provide a suggested batch on the bug. It will simply be a wholly > different implementation of median and friends. I ask for a documentation patch and you start talking about a whole new implementation. Huh. A new implementation with precisely the same behaviour is a waste of time, so I presume you're planning to change the behaviour. How about if you start off by explaining what the new semantics are? -- Steve ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Sun, 6 Jan 2019 19:40:32 -0800 Stephan Hoyer wrote: > On Sun, Jan 6, 2019 at 4:27 PM Steven D'Aprano wrote: > > > I propose adding a "nan_policy" keyword-only parameter to the relevant > > statistics functions (mean, median, variance etc), and defining the > > following policies: > > > > IGNORE: quietly ignore all NANs > > FAIL: raise an exception if any NAN is seen in the data > > PASS: pass NANs through unchanged (the default) > > RETURN: return a NAN if any NAN is seen in the data > > WARN: ignore all NANs but raise a warning if one is seen > > > > I don't think PASS should be the default behavior, and I'm not sure it > would be productive to actually implement all of these options. > > For reference, NumPy and pandas (the two most popular packages for data > analytics in Python) support two of these modes: > - RETURN (numpy.mean() and skipna=False for pandas) > - IGNORE (numpy.nanmean() and skipna=True for pandas) > > RETURN is the default behavior for NumPy; IGNORE is the default for pandas. I agree with Stephan that RETURN and IGNORE are the only useful modes of operation here. Regards Antoine. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
(By the way, I'm not outright disagreeing with you, I'm trying to weigh up the pros and cons of your position. You've given me a lot to think about. More below.) On Sun, Jan 06, 2019 at 11:31:30PM -0800, Nathaniel Smith wrote: > On Sun, Jan 6, 2019 at 11:06 PM Steven D'Aprano wrote: > > I'm not wedded to the idea that the default ought to be the current > > behaviour. If there is a strong argument for one of the others, I'm > > listening. > > "Errors should never pass silently"? Silently returning nonsensical > results is hard to defend as a default behavior IMO :-) If you violate the assumptions of the function, just about everything can in principle return nonsensical results. True, most of the time you have to work hard at it: class MyList(list): def __len__(self): return random.randint(0, sys.maxint) but it isn't unreasonable to document the assumptions of a function, and if the caller violates those assumptions, Garbage In Garbage Out applies. E.g. bisect requires that your list is sorted in ascending order. If it isn't, the results you get are nonsensical. py> data = [8, 6, 4, 2, 0] py> bisect.bisect(data, 1) 0 That's not a bug in bisect, that's a bug in the caller's code, and it isn't bisect's responsibility to fix it. Although it could be documented better, that's the current situation with NANs and median(). Data with NANs don't have a total ordering, and total ordering is the unstated assumption behind the idea of a median or middle value. So all bets are off. > > How would you answer those who say that the right behaviour is not to > > propogate unwanted NANs, but to fail fast and raise an exception? > > Both seem defensible a priori, but every other mathematical operation > in Python propagates NaNs instead of raising an exception. Is there > something unusual about median that would justify giving it unusual > behavior? Well, not everything... py> NAN/0 Traceback (most recent call last): File "", line 1, in ZeroDivisionError: float division by zero There may be others. But I'm not sure that "everything else does it" is a strong justification. It is *a* justification, since consistency is good, but consistency does not necessarily outweigh other concerns. One possible argument for making PASS the default, even if that means implementation-dependent behaviour with NANs, is that in the absense of a clear preference for FAIL or RETURN, at least PASS is backwards compatible. You might shoot yourself in the foot, but at least you know its the same foot you shot yourself in using the previous version *wink* -- Steve ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Sun, Jan 6, 2019 at 11:06 PM Steven D'Aprano wrote: > I'm not wedded to the idea that the default ought to be the current > behaviour. If there is a strong argument for one of the others, I'm > listening. "Errors should never pass silently"? Silently returning nonsensical results is hard to defend as a default behavior IMO :-) > How would you answer those who say that the right behaviour is not to > propogate unwanted NANs, but to fail fast and raise an exception? Both seem defensible a priori, but every other mathematical operation in Python propagates NaNs instead of raising an exception. Is there something unusual about median that would justify giving it unusual behavior? -n -- Nathaniel J. Smith -- https://vorpus.org ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Sun, Jan 06, 2019 at 07:40:32PM -0800, Stephan Hoyer wrote: > On Sun, Jan 6, 2019 at 4:27 PM Steven D'Aprano wrote: > > > I propose adding a "nan_policy" keyword-only parameter to the relevant > > statistics functions (mean, median, variance etc), and defining the > > following policies: > > > > IGNORE: quietly ignore all NANs > > FAIL: raise an exception if any NAN is seen in the data > > PASS: pass NANs through unchanged (the default) > > RETURN: return a NAN if any NAN is seen in the data > > WARN: ignore all NANs but raise a warning if one is seen > > > > I don't think PASS should be the default behavior, and I'm not sure it > would be productive to actually implement all of these options. I'm not wedded to the idea that the default ought to be the current behaviour. If there is a strong argument for one of the others, I'm listening. > For reference, NumPy and pandas (the two most popular packages for data > analytics in Python) support two of these modes: > - RETURN (numpy.mean() and skipna=False for pandas) > - IGNORE (numpy.nanmean() and skipna=True for pandas) > > RETURN is the default behavior for NumPy; IGNORE is the default for pandas. > > I'm pretty sure RETURN is the right default behavior for Python's standard > library and anything else should be considered a bug. It safely propagates > NaNs, along the lines of IEEE float behavior. How would you answer those who say that the right behaviour is not to propogate unwanted NANs, but to fail fast and raise an exception? > I'm not sure what the use cases are for PASS, FAIL, or WARN, none of which > are supported by NumPy or pandas: > - PASS is a license to return silently incorrect results, in return for > very marginal performance benefits. By my (very rough) preliminary testing, the cost of checking for NANs doubles the cost of calculating the median, and increases the cost of calculating the mean() by 25%. I'm not trying to compete with statistics libraries written in C for speed, but that doesn't mean I don't care about performance at all. The statistics library is already slower than I like and I don't want to slow it down further for the common case (numeric data with no NANs) for the sake of the uncommon case (data with NANs). But I hear you about the "return silently incorrect results" part. Fortunately, I think that only applies to sort-based functions like median(). mean() etc ought to propogate NANs with any reasonable implementation, but I'm reluctant to make that a guarantee in case I come up with some unreasonable implementation :-) > This seems at odds with the intended > focus of the statistics module on correctness over speed. Returning > incorrect statistics should not be considered a feature that needs to be > maintained. It is only incorrect because the data violates the documented requirement that it be *numeric data*, and the undocumented requirement that the numbers have a total order. (So complex numbers are out.) I admit that the docs could be improved, but there are no guarantees made about NANs. This doesn't mean I don't want to improve the situation! Far from it, hence this discussion. > - FAIL would make sense if statistics functions could introduce *new* NaN > values. But as far as I can tell, statistics functions already raise > StatisticsError in these cases (e.g., if zero data point are provided). If > users are concerned about accidentally propagating NaNs, they should be > encouraged to check for NaNs at the entry points of their code. As far as I can tell, there are two kinds of people when it comes to NANs: those who think that signalling NANs are a waste of time and NANs should always propogate, and those who hate NANs and wish that they would always signal (raise an exception). I'm not going to get into an argument about who is right or who is wrong. > - WARN is even less useful than FAIL. Seriously, who likes warnings? Me :-) > NumPy > uses this approach for in array operations that produce NaNs (e.g., when > dividing by zero), because *some* but not all results may be valid. But > statistics functions return scalars. > > I'm not even entirely sure it makes sense to add the IGNORE option, or at > least to add it only for NaN. None is also a reasonable sentinel for a > missing value in Python, and user defined types (e.g., pandas.NaT) also > fall in this category. It seems a little strange to single NaN out in > particular. I am considering adding support for a dedicated "missing" value, whether it is None or a special sentinel. But one thing at a time. Ignoring NANs is moderately common in other statistics libraries, and although I personally feel that NANs shouldn't be used for missing values, I know many people do so. -- Steve ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct:
Re: [Python-ideas] NAN handling in the statistics module
On Mon, Jan 7, 2019 at 1:27 AM Steven D'Aprano wrote: > > In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5]) > > Out[4]: 1 > > In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4]) > > Out[5]: nan > > The second is possibly correct if one thinks that the median of a list > containing NAN should return NAN -- but its only correct by accident, > not design. > Exactly... in the second example, the nan just happens to wind up "in the middle" of the sorted() list. The fact that is the return value has nothing to do propagating the nan (if it did, I think it would be a reasonable answer). I contrived the examples to get these... the first answer which is the "most wrong number" is also selected for the same reason than a nan is "near the middle." > I'm not opposed to documenting this better. Patches welcome :-) > I'll provide a suggested batch on the bug. It will simply be a wholly different implementation of median and friends. > There are at least three correct behaviours in the face of data > containing NANs: propogate a NAN result, fail fast with an exception, or > treat NANs as missing data that can be ignored. Only the caller can > decide which is the right policy for their data set. I'm not sure that raising right away is necessary as an option. That feels like something a user could catch at the end when they get a NaN result. But those seem reasonable as three options. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Sun, Jan 06, 2019 at 10:52:47PM -0500, David Mertz wrote: > Playing with Tim's examples, this suggests that statistics.median() is > simply outright WRONG. I can think of absolutely no way to characterize > these as reasonable results: > > Python 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 09:50:42) > In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5]) > Out[4]: 1 > In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4]) > Out[5]: nan The second is possibly correct if one thinks that the median of a list containing NAN should return NAN -- but its only correct by accident, not design. As I wrote on the bug tracker: "I agree that the current implementation-dependent behaviour when there are NANs in the data is troublesome." The only reason why I don't call it a bug is that median() makes no promises about NANs at all, any more than it makes promises about the median of a list of sets or any other values which don't define a total order. help(median) says: Return the median (middle value) of numeric data. By definition, data containing Not A Number values isn't numeric :-) I'm not opposed to documenting this better. Patches welcome :-) There are at least three correct behaviours in the face of data containing NANs: propogate a NAN result, fail fast with an exception, or treat NANs as missing data that can be ignored. Only the caller can decide which is the right policy for their data set. Aside: the IEEE-754 standard provides both signalling and quiet NANs. It is hard and unreliable to generate signalling float NANs in Python, but we can do it with Decimal: py> from statistics import median py> from decimal import Decimal py> median([1, 3, 4, Decimal("sNAN"), 2]) Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.5/statistics.py", line 349, in median data = sorted(data) decimal.InvalidOperation: [] In principle, one ought to be able to construct float signalling NANs too, but unfortunately that's platform dependent: https://mail.python.org/pipermail/python-dev/2018-November/155713.html Back to the topic on hand: I agree that median() does "the wrong thing" when NANs are involved, but there is no one "right thing" that we can do in its place. People disagree as to whether NANs should propogate, or raise, or be treated as missing data, and I see good arguments for all three. -- Steve ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
[David Mertz ] > OK, let me be more precise. Obviously if the implementation in a class is: > > class Foo: > def __lt__(self, other): > return random.random() < 0.5 > > Then we aren't going to rely on much. > > * If comparison of any two items in a list (under __lt__) is deterministic, is > the resulting sort order deterministic? (Pretty sure this is a yes) Yes, but not defined unless __lt__ also defines a total ordering. > * If the pairwise comparisons are deterministic, is sorting idempotent? Not necessarily. For example, the 2-element list here swaps its elements every time `.sort()` is invoked, because the second element always claims it's "less than" the first element, regardless of which order they're in: class RelentlesslyTiny: def __init__(self, name): self.name = name def __repr__(self): return self.name def __lt__(self, other): return self is not other a = RelentlesslyTiny("A") b = RelentlesslyTiny("B") xs = [a, b] print(xs) xs.sort() print("after sorting once", xs) xs.sort() print("after sorting twice", xs) [A, B] after sorting once [B, A] after sorting twice [A, B] > This statement is certainly false: > > * If two items are equal, and pairwise inequality is deterministic, exchanging > the items does not affect the sorting of other items in the list. What I said at the start ;-) The only thing .sort() always guarantees regardless of how goofy __lt__ may be is that the result list will be some permutation of the input list. This is so even if __lt__ raises an uncaught exception, killing the sort mid-stream. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
This statement is certainly false: > > * If two items are equal, and pairwise inequality is deterministic, > exchanging the items does not affect the sorting of other items in the list. > Just to demonstrate this obviousness: >>> sorted([9, 9, 9, b, 1, 2, 3, a]) [1, 2, 3, A, B, 9, 9, 9] >>> sorted([9, 9, 9, a, 1, 2, 3, b]) [B, 9, 9, 9, A, 1, 2, 3] >>> a == b True The classes involved are: class A: def __lt__(self, other): return False __gt__ = __lt__ def __eq__(self, other): return True def __repr__(self): return self.__class__.__name__ class B(A): def __lt__(self, other): return True __gt__ = __lt__ I do not think these are useful, but __lt__ is deterministic here. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Mon, Jan 7, 2019 at 3:19 PM David Mertz wrote: > > OK, let me be more precise. Obviously if the implementation in a class is: > > class Foo: > def __lt__(self, other): > return random.random() < 0.5 > > > Then we aren't going to rely on much. > > * If comparison of any two items in a list (under __lt__) is deterministic, > is the resulting sort order deterministic? (Pretty sure this is a yes) If you guarantee that exactly one of "x < y" and "y < x" is true for any given pair of values from the list, and further guarantee that if x < y and y < z then x < z, you have a total order. Without those two guarantees, you could have deterministic comparisons (eg "nan < 5" is always false, but so is "5 < nan"), but there's no way to truly put the elements "in order". Defining __lt__ as "rock < paper", "paper < scissors", "scissors < rock" means that you can't guarantee the sort order, nor determinism. Are those guarantees safe for your purposes? If so, sort() is, AIUI, guaranteed to behave sanely. ChrisA ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
OK, let me be more precise. Obviously if the implementation in a class is: class Foo: def __lt__(self, other): return random.random() < 0.5 Then we aren't going to rely on much. * If comparison of any two items in a list (under __lt__) is deterministic, is the resulting sort order deterministic? (Pretty sure this is a yes) * If the pairwise comparisons are deterministic, is sorting idempotent? This statement is certainly false: * If two items are equal, and pairwise inequality is deterministic, exchanging the items does not affect the sorting of other items in the list. On Sun, Jan 6, 2019 at 11:09 PM Tim Peters wrote: > [David Mertz ] > > Thanks Tim for clarifying. Is it even the case that sorts are STABLE in > > the face of non-total orderings under __lt__? A couple quick examples > > don't refute that, but what I tried was not very thorough, nor did I > > think much about TimSort itself. > > I'm not clear on what "stable" could mean in the absence of a total > ordering. Not only does sort not assume __lt__ is a total ordering, > it doesn't assume it's transitive, or even deterministic. We really > can't assume anything about potentially user-defined functions. > > What sort does guarantee is that the result list is some permutation > of the input list, regardless of how insanely __lt__ may behave. If > __lt__ sanely defines a deterministic total order, then "stable" and > "sorted" are guaranteed too, with their obvious meanings. > -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
[David Mertz ] > Thanks Tim for clarifying. Is it even the case that sorts are STABLE in > the face of non-total orderings under __lt__? A couple quick examples > don't refute that, but what I tried was not very thorough, nor did I > think much about TimSort itself. I'm not clear on what "stable" could mean in the absence of a total ordering. Not only does sort not assume __lt__ is a total ordering, it doesn't assume it's transitive, or even deterministic. We really can't assume anything about potentially user-defined functions. What sort does guarantee is that the result list is some permutation of the input list, regardless of how insanely __lt__ may behave. If __lt__ sanely defines a deterministic total order, then "stable" and "sorted" are guaranteed too, with their obvious meanings. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
[... apologies if this is dup, got a bounce ...] > [David Mertz ] >> I have to say though that the existing behavior of `statistics.median[_low|_high|]` >> is SURPRISING if not outright wrong. It is the behavior in existing Python, >> but it is very strange. >> >> The implementation simply does whatever `sorted()` does, which is an >> implementation detail. In particular, NaN's being neither less than nor >> greater than any floating point number, just stay where they are during >> sorting. > > I expect you inferred that from staring at a handful of examples, but > it's illusion. Python's sort uses only __lt__ comparisons, and if > those don't implement a total ordering then _nothing_ is defined about > sort's result (beyond that it's some permutation of the original > list). Thanks Tim for clarifying. Is it even the case that sorts are STABLE in the face of non-total orderings under __lt__? A couple quick examples don't refute that, but what I tried was not very thorough, nor did I think much about TimSort itself. > So, certainly, if you want median to be predictable in the presence of > NaNs, sort's behavior in the presence of NaNs can't be relied on in > any respect. Playing with Tim's examples, this suggests that statistics.median() is simply outright WRONG. I can think of absolutely no way to characterize these as reasonable results: Python 3.7.1 | packaged by conda-forge | (default, Nov 13 2018, 09:50:42) In [4]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4, 5]) Out[4]: 1 In [5]: statistics.median([9, 9, 9, nan, 1, 2, 3, 4]) Out[5]: nan ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
[David Mertz ] > I have to say though that the existing behavior of > `statistics.median[_low|_high|]` > is SURPRISING if not outright wrong. It is the behavior in existing Python, > but it is very strange. > > The implementation simply does whatever `sorted()` does, which is an > implementation detail. In particular, NaN's being neither less than nor > greater than any floating point number, just stay where they are during > sorting. I expect you inferred that from staring at a handful of examples, but it's illusion. Python's sort uses only __lt__ comparisons, and if those don't implement a total ordering then _nothing_ is defined about sort's result (beyond that it's some permutation of the original list). There's nothing special about NaNs in this. For example, if you sort a list of sets, then "<" means subset inclusion, which doesn't define a total ordering among sets in general either (unless for every pair of sets in a specific list, one is a proper subset of the other - in which case the list of sets will be sorted in order of increasing cardinality). > But that's a particular feature of TimSort. Yes, we are guaranteed that sorts > are stable; and we have rules about which things can and cannot be compared > for inequality at all. But beyond that, I do not think Python ever promised > that > NaNs would remain in the same positions after sorting We don't promise it, and it's not true. For example, >>> import math >>> nan = math.nan >>> xs = [0, 1, 2, 4, nan, 5, 3] >>> sorted(xs) [0, 1, 2, 3, 4, nan, 5] The NaN happened to move "one place to the right" there. There's no point to analyzing "why" - it's purely an accident deriving from the pattern of __lt__ outcomes the internals happened to invoke. FYI, it goes like so: is 1 < 0? No, so the first two are already sorted. is 2 < 1? No, so the first three are already sorted. is 4 < 2? No, so the first four are already sorted is nan < 4? No, so the first five are already sorted is 5 < nan? No, so the first six are already sorted is 3 < 5? Yes! At that point a binary insertion is used to move 3 into place. And none of timsort's "fancy" parts even come into play for lists so small. The patterns of comparisons the fancy parts invoke can be much more involved. At no point does the algorithm have any idea that there are NaNs in the list - it only looks at boolean __lt__ outcomes. So, certainly, if you want median to be predictable in the presence of NaNs, sort's behavior in the presence of NaNs can't be relied on in any respect. >>> sorted([6, 5, nan, 4, 3, 2, 1]) [1, 2, 3, 4, 5, 6, nan] >>> sorted([9, 9, 9, 9, 9, 9, nan, 1, 2, 3, 4, 5, 6]) [9, 9, 9, 9, 9, 9, nan, 1, 2, 3, 4, 5, 6] ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Sun, Jan 6, 2019 at 4:27 PM Steven D'Aprano wrote: > I propose adding a "nan_policy" keyword-only parameter to the relevant > statistics functions (mean, median, variance etc), and defining the > following policies: > > IGNORE: quietly ignore all NANs > FAIL: raise an exception if any NAN is seen in the data > PASS: pass NANs through unchanged (the default) > RETURN: return a NAN if any NAN is seen in the data > WARN: ignore all NANs but raise a warning if one is seen > I don't think PASS should be the default behavior, and I'm not sure it would be productive to actually implement all of these options. For reference, NumPy and pandas (the two most popular packages for data analytics in Python) support two of these modes: - RETURN (numpy.mean() and skipna=False for pandas) - IGNORE (numpy.nanmean() and skipna=True for pandas) RETURN is the default behavior for NumPy; IGNORE is the default for pandas. I'm pretty sure RETURN is the right default behavior for Python's standard library and anything else should be considered a bug. It safely propagates NaNs, along the lines of IEEE float behavior. I'm not sure what the use cases are for PASS, FAIL, or WARN, none of which are supported by NumPy or pandas: - PASS is a license to return silently incorrect results, in return for very marginal performance benefits. This seems at odds with the intended focus of the statistics module on correctness over speed. Returning incorrect statistics should not be considered a feature that needs to be maintained. - FAIL would make sense if statistics functions could introduce *new* NaN values. But as far as I can tell, statistics functions already raise StatisticsError in these cases (e.g., if zero data point are provided). If users are concerned about accidentally propagating NaNs, they should be encouraged to check for NaNs at the entry points of their code. - WARN is even less useful than FAIL. Seriously, who likes warnings? NumPy uses this approach for in array operations that produce NaNs (e.g., when dividing by zero), because *some* but not all results may be valid. But statistics functions return scalars. I'm not even entirely sure it makes sense to add the IGNORE option, or at least to add it only for NaN. None is also a reasonable sentinel for a missing value in Python, and user defined types (e.g., pandas.NaT) also fall in this category. It seems a little strange to single NaN out in particular. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
I have to say though that the existing behavior of `statistics.median[_low|_high|]` is SURPRISING if not outright wrong. It is the behavior in existing Python, but it is very strange. The implementation simply does whatever `sorted()` does, which is an implementation detail. In particular, NaN's being neither less than nor greater than any floating point number, just stay where they are during sorting. But that's a particular feature of TimSort. Yes, we are guaranteed that sorts are stable; and we have rules about which things can and cannot be compared for inequality at all. But beyond that, I do not think Python ever promised that NaNs would remain in the same positions after sorting if some other position was stable under a different sorting algorithm. So in the incredibly unlikely even I invent a DavidSort that behaves better than TimSort, is stable, and compares only the same Python objects as current CPython, a future version could use this algorithm without breaking promises... even if NaN's sometimes sorted differently than in TimSort. For that matter, some new implementation could use my not-nearly-as-good DavidSort, and while being slower, would still be compliant. Relying on that for the result of `median()` feels strange to me. It feels strange as the default behavior, but that's the status quo. But it feels even stranger that there are not at least options to deal with NaNs in more of the signaling or poisoning ways that every other numeric library does. On Sun, Jan 6, 2019 at 7:28 PM Steven D'Aprano wrote: > Bug #33084 reports that the statistics library calculates median and > other stats wrongly if the data contains NANs. Worse, the result depends > on the initial placement of the NAN: > > py> from statistics import median > py> NAN = float('nan') > py> median([NAN, 1, 2, 3, 4]) > 2 > py> median([1, 2, 3, 4, NAN]) > 3 > > See the bug report for more detail: > > https://bugs.python.org/issue33084 > > > The caller can always filter NANs out of their own data, but following > the lead of some other stats packages, I propose a standard way for the > statistics module to do so. I hope this will be uncontroversial (he > says, optimistically...) but just in case, here is some prior art: > > (1) Nearly all R stats functions take a "na.rm" argument which defaults > to False; if True, NA and NAN values will be stripped. > > (2) The scipy.stats.ttest_ind function takes a "nan_policy" argument > which specifies what to do if a NAN is seen in the data. > > > https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html > > (3) At least some Matlab functions, such as mean(), take an optional > flag that determines whether to ignore NANs or include them. > > https://au.mathworks.com/help/matlab/ref/mean.html#bt5b82t-1-nanflag > > > I propose adding a "nan_policy" keyword-only parameter to the relevant > statistics functions (mean, median, variance etc), and defining the > following policies: > > IGNORE: quietly ignore all NANs > FAIL: raise an exception if any NAN is seen in the data > PASS: pass NANs through unchanged (the default) > RETURN: return a NAN if any NAN is seen in the data > WARN: ignore all NANs but raise a warning if one is seen > > PASS is equivalent to saying that you, the caller, have taken full > responsibility for filtering out NANs and there's no need for the > function to slow down processing by doing so again. Either that, or you > want the current implementation-dependent behaviour. > > FAIL is equivalent to treating all NANs as "signalling NANs". The > presence of a NAN is an error. > > RETURN is equivalent to "NAN poisoning" -- the presence of a NAN in a > calculation causes it to return a NAN, allowing NANs to propogate > through multiple calculations. > > IGNORE and WARN are the same, except IGNORE is silent and WARN raises a > warning. > > Questions: > > - does anyone have an serious objections to this? > > - what do you think of the names for the policies? > > - are there any additional policies that you would like to see? > (if so, please give use-cases) > > - are you happy with the default? > > > Bike-shed away! > > > > -- > Steve > ___ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
On Sun, Jan 06, 2019 at 07:46:03PM -0500, David Mertz wrote: > Would these policies be named as strings or with an enum? Following Pandas, > we'd probably support both. Sure, I can support both. > I won't bikeshed the names, but they seem to > cover desired behaviors. Good to hear. -- Steve ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] NAN handling in the statistics module
Would these policies be named as strings or with an enum? Following Pandas, we'd probably support both. I won't bikeshed the names, but they seem to cover desired behaviors. On Sun, Jan 6, 2019, 7:28 PM Steven D'Aprano Bug #33084 reports that the statistics library calculates median and > other stats wrongly if the data contains NANs. Worse, the result depends > on the initial placement of the NAN: > > py> from statistics import median > py> NAN = float('nan') > py> median([NAN, 1, 2, 3, 4]) > 2 > py> median([1, 2, 3, 4, NAN]) > 3 > > See the bug report for more detail: > > https://bugs.python.org/issue33084 > > > The caller can always filter NANs out of their own data, but following > the lead of some other stats packages, I propose a standard way for the > statistics module to do so. I hope this will be uncontroversial (he > says, optimistically...) but just in case, here is some prior art: > > (1) Nearly all R stats functions take a "na.rm" argument which defaults > to False; if True, NA and NAN values will be stripped. > > (2) The scipy.stats.ttest_ind function takes a "nan_policy" argument > which specifies what to do if a NAN is seen in the data. > > > https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html > > (3) At least some Matlab functions, such as mean(), take an optional > flag that determines whether to ignore NANs or include them. > > https://au.mathworks.com/help/matlab/ref/mean.html#bt5b82t-1-nanflag > > > I propose adding a "nan_policy" keyword-only parameter to the relevant > statistics functions (mean, median, variance etc), and defining the > following policies: > > IGNORE: quietly ignore all NANs > FAIL: raise an exception if any NAN is seen in the data > PASS: pass NANs through unchanged (the default) > RETURN: return a NAN if any NAN is seen in the data > WARN: ignore all NANs but raise a warning if one is seen > > PASS is equivalent to saying that you, the caller, have taken full > responsibility for filtering out NANs and there's no need for the > function to slow down processing by doing so again. Either that, or you > want the current implementation-dependent behaviour. > > FAIL is equivalent to treating all NANs as "signalling NANs". The > presence of a NAN is an error. > > RETURN is equivalent to "NAN poisoning" -- the presence of a NAN in a > calculation causes it to return a NAN, allowing NANs to propogate > through multiple calculations. > > IGNORE and WARN are the same, except IGNORE is silent and WARN raises a > warning. > > Questions: > > - does anyone have an serious objections to this? > > - what do you think of the names for the policies? > > - are there any additional policies that you would like to see? > (if so, please give use-cases) > > - are you happy with the default? > > > Bike-shed away! > > > > -- > Steve > ___ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/