[issue43475] Worst-case behaviour of hash collision with float NaN

2021-11-24 Thread Christoph Groth


Christoph Groth  added the comment:

> What concrete action would you propose that the Python core devs take at this 
> point?

Nothing for now.

I stumbled across this issue through 
https://gitlab.kwant-project.org/kwant/tinyarray/-/issues/20 and had the 
impression that the aspect that I raised (that, for example, hash values of 
immutable built-in objects now no longer survive pickling) was not examined in 
this otherwise in-depth discussion.  So I added it for reference.

If problems come up that are caused by this change, I would consider reverting 
it a possible solution.

> The result of the change linked to this PR is that the hash now also reflects 
> that containment depends on object identity, not just object value.

This is a nice way to phrase it.  Thanks for the link to the entertaining talk. 
:-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-11-24 Thread Mark Dickinson


Mark Dickinson  added the comment:

Just for fun: I gave a somewhat ranty 10-minute talk on this topic at a 
(virtual) conference a few months ago: 
https://www.youtube.com/watch?v=01oeosRVwgY

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-11-24 Thread Mark Dickinson


Mark Dickinson  added the comment:

@cwg: Yep, we're aware of this. There are no good solutions here - only a mass 
of constraints, compromises and trade-offs. I think we're already somewhere on 
the Pareto boundary of the "best we can do" given the constraints. Moving to 
another point on the boundary doesn't seem worth the code churn.

What concrete action would you propose that the Python core devs take at this 
point?

> it was possible to convert a tuple of floats into a numpy array and back into 
> a tuple, and the hash values of both tuples would be equal.  This is no 
> longer the case.

Sure, but the problem isn't really with hash; that's just a detail. It lies 
deeper than that - it's with containment itself:

>>> import numpy as np
>>> import math
>>> x = math.nan
>>> some_list = [1.5, 2.3, x]
>>> x in some_list
True
>>> x in list(np.array(some_list))  # expect True, get False
False

The result of the change linked to this PR is that the hash now also reflects 
that containment depends on object identity, not just object value. Reverting 
the change would solve the superficial hash problem, but not the underlying 
containment problem, and would re-introduce the performance issue that was 
fixed here.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-11-24 Thread Christoph Groth


Christoph Groth  added the comment:

Hello.  I would like to point out a possible problem that this change to 
CPython has introduced.

This change looks innocent, but it breaks the unwritten rule that the hash 
value of a number (or actually any built-in immutable type!) in Python depends 
only on its value.

Thus, before this change, it was possible to convert a tuple of floats into a 
numpy array and back into a tuple, and the hash values of both tuples would be 
equal.  This is no longer the case.

Or, more generally, any hashable tuple could be pickled and unpickled, without 
changing its hash value.  I could well imagine that this breaks real code in 
subtle ways.

Likewise, it is now no longer possible to provide a library of sequences of 
numbers that always hashes like built-in tuples.  (As "tinyarray", of which I 
am the author, did.)

--
nosy: +cwg

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-16 Thread Mark Dickinson


Mark Dickinson  added the comment:

I think this can be closed now.

--
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-15 Thread Mark Dickinson


Mark Dickinson  added the comment:


New changeset 8d0b2ca493e236fcad8709a622c1ac8ad29c372d by Mark Dickinson in 
branch '3.10':
[3.10] bpo-43475: Add what's new entry for NaN hash changes (GH-26725) 
(GH-26743)
https://github.com/python/cpython/commit/8d0b2ca493e236fcad8709a622c1ac8ad29c372d


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-15 Thread Mark Dickinson


Change by Mark Dickinson :


--
pull_requests: +25328
pull_request: https://github.com/python/cpython/pull/26743

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-15 Thread Mark Dickinson


Mark Dickinson  added the comment:


New changeset 1d10bf0bb9409a406c56b0de8870df998637fd0f by Mark Dickinson in 
branch 'main':
bpo-43475: Add what's new entry for NaN hash changes (GH-26725)
https://github.com/python/cpython/commit/1d10bf0bb9409a406c56b0de8870df998637fd0f


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-14 Thread Mark Dickinson


Change by Mark Dickinson :


--
pull_requests: +25315
pull_request: https://github.com/python/cpython/pull/26725

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-14 Thread Mark Dickinson


Mark Dickinson  added the comment:

> Does this change need to be mentioned in What's New?

Yes, I think so, given that it's a change to documented behavior. It's also 
something that third party packages (like NumPy) potentially need to be aware 
of.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-13 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

> If one wants to have all NaNs in one equivalency class
> (e.g. if used as a key-value for example in pandas) it
> is almost impossible to do so in a consistent way 
> without taking a performance hit.

ISTM the performance of the equivalent class case is far less important than 
the one we were trying to solve.  Given a choice we should prefer helping 
normal unadorned instances rather than giving preference to a subclass that 
redefines the usual behaviors.  

In CPython, it is a fact of life that overriding builtin behaviors with pure 
python code always incurs a performance hit.  Also, in your example, the 
subclass isn't technically correct because it relies on a non-guaranteed 
implementation details.  It likely isn't even the fastest approach.

The only guaranteed behaviors are that math.isnan(x) reliably detects a NaN and 
that x!=x when x is a NaN.  Those are the only assured tools in the uphill 
battle to fight the weird intrinsic nature of NaNs.

So one possible solution is to replace all the NaNs with a canonical 
placeholder value that doesn't have undesired properties:

{None if isnan(x) else x for x in arr}

That relies on guaranteed behaviors and is reasonably fast.  IMO that beats 
trying to reprogram float('NaN') to behave the opposite of how it was designed.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-13 Thread realead


realead  added the comment:

@mark.dickinson

> ...my expectation was that there would be few cases of breakage, and that for 
> those few cases it shouldn't be difficult to fix the breakage.

This expectation is probably correct.

My issue is somewhat only partly on-topic here: If one wants to have all NaNs 
in one equivalency class (e.g. if used as a key-value for example in pandas) it 
is almost impossible to do so in a consistent way without taking a performance 
hit. It seems to be possible builtin-types (even if frozenset won't be pretty), 
but already something like


class A:
def __init__(self, a):
self.a=a
def __hash__(self):
return hash(self.a)
def __eq__(self, other):
return self.a == other.a

is not easy to handle.

A special comparator for containers would be an ultimative solution, but I see 
how this could be too much hassle for a corner case.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-13 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

Does this change need to be mentioned in What's New?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-13 Thread miss-islington


miss-islington  added the comment:


New changeset 128899d8b8d65d86bd9bbea6801e9f36e6f409f2 by Miss Islington (bot) 
in branch '3.10':
bpo-43475: Fix the Python implementation of hash of Decimal NaN (GH-26679)
https://github.com/python/cpython/commit/128899d8b8d65d86bd9bbea6801e9f36e6f409f2


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-13 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

> Maybe a new comparator is called for (like PY_EQ_FOR_HASH_COLLECTION), which 
> would yield float("nan") equivalent to float("nan") and which should be used 
> in hash-set/dict?

It is difficult to do from technical point of view because all rich comparison 
implementations support currently only 6 operations. If you call tp_richcomp 
with something else it will crash. So we need to add new slot tp_richcomp_ex 
and complex logic to call either tp_richcomp_ex or tp_richcomp and correctly 
handle inheritance. It is too high bar.

> Does GH-26679 need to be backported to the 3.10 branch?

I thought that the original change was 3.11 only, but seems it was merged 
before the split of the 3.10 branch. So PR 26679 needs to be backported. I 
would add a NEWS entry if I knew this.

--
versions: +Python 3.10

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-13 Thread miss-islington


Change by miss-islington :


--
nosy: +miss-islington
nosy_count: 6.0 -> 7.0
pull_requests: +25292
pull_request: https://github.com/python/cpython/pull/26706

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-13 Thread Mark Dickinson


Mark Dickinson  added the comment:

@Serhiy: can this be closed again? Does GH-26679 need to be backported to the 
3.10 branch?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-13 Thread Mark Dickinson


Mark Dickinson  added the comment:

@realead

> This change makes life harder for people trying to get sane behavior with 
> sets [...]

Yes, code that assumes that all NaNs have the same hash value will need to 
change. But that doesn't seem unreasonable for objects that are already 
considered distinct with respect to the containment equivalence relation ("is" 
followed by "=="). I think this change was the right one to make, and my 
expectation was that there would be few cases of breakage, and that for those 
few cases it shouldn't be difficult to fix the breakage. If that's not true 
(either there are lots of cases of breakage, or the breakage is hard to fix), 
perhaps we should reconsider. I haven't yet seen evidence that either of those 
things is true, though.

> Maybe a new comparator is called for [...]

Agreed that that seems like a possible technical solution: Python could add a 
new special method __container_eq__ which is required to provide (at the least) 
a reflexive and symmetric relation, and whose default implementation does the 
same is-followed-by-== check that's already used for list, dict and set 
containment. For what it's worth, this is close to the approach that Julia 
takes: they have an "is_equal" function that's mostly the same as the normal 
equality == but differs for NaNs (and incidentally for signed zeros). But 
whether this would make sense from a language design perspective is another 
matter, and it would be a significant language-level change (certainly 
requiring a PEP). It feels like an enormous conceptual change to introduce to 
the language just to deal with the annoyance of NaNs. And that really is all it 
would be good for, unless perhaps it also provides a solution to the problem of 
NumPy arrays in containers, again caused by NumPy arrays overriding __eq__ in a
  nonstandard way.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-12 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:


New changeset 9f1c5f6e8af6ba3f659b2aea1e221ac9695828ba by Serhiy Storchaka in 
branch 'main':
bpo-43475: Fix the Python implementation of hash of Decimal NaN (GH-26679)
https://github.com/python/cpython/commit/9f1c5f6e8af6ba3f659b2aea1e221ac9695828ba


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-11 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
pull_requests: +25265
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/26679

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-11 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

There is an error in the Python implementation of Decimal.__hash__. It calls 
super().__hash__(), but the C implementation calls object.__hash__().

Also, the documentation for floating point hash has the same error.

--
stage: resolved -> 
status: closed -> open
versions: +Python 3.11

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-06-11 Thread realead


realead  added the comment:

This change makes life harder for people trying to get sane behavior with sets 
for float-, complex- and tuple-objects containing NaNs. With "sane" meaning, 
that 

set([nan, nan, nan]) => {nan}

Until now, one has only to override the equal-comparison for these types but 
now also hash-function needs to be overriden (which is more complicated e.g. 
for tuple).



On a more abstract level: there are multiple possible equivalence relations. 

The one used right now for floats is Py_EQ and results in 
float("nan)!=float("nan") due to IEEE 754.

In hash-set/dict we would like to use an equivalence relation for which all 
NaNs would be in the same equivalence class. Maybe a new comparator is called 
for (like PY_EQ_FOR_HASH_COLLECTION), which would yield float("nan") equivalent 
to float("nan") and which should be used in hash-set/dict?

--
nosy: +realead

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-22 Thread Mark Dickinson


Mark Dickinson  added the comment:

Opened https://github.com/numpy/numpy/issues/18833

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-22 Thread Mark Dickinson


Mark Dickinson  added the comment:

Thanks, Raymond. I'll open something on the NumPy bug tracker, too, since they 
may want to consider doing something similar for numpy.float64 and friends.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-22 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-22 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset a07da09ad5bd7d234ccd084a3a0933c290d1b592 by Raymond Hettinger in 
branch 'master':
bpo-43475:  Fix worst case collision behavior for NaN instances (GH-25493)
https://github.com/python/cpython/commit/a07da09ad5bd7d234ccd084a3a0933c290d1b592


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-20 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
keywords: +patch
pull_requests: +24216
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/25493

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-10 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

> I'm loathe to guarantee anything about this in the language itself. 

There aren't language any guarantees being proposed.  Letting the hash depend 
on the object id just helps avoid quadratic behavior.  Making float('NaN') a 
singleton is also perfectly reasonable behavior for an immutable type.  Neither 
is a guaranteed behavior, just a convenient one.


> I'm not convinced it addresses a real-world problem

Maybe yes, maybe no.  I would hope that NaNs arising from bogus calculations 
would be rare.  OTOH, they are commonly used for missing values in Pandas where 
internal dict/set operations abound.  Either way, I would like to close off a 
trivially easy way to invoke quadratic behavior unexpectedly.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-10 Thread Tim Peters


Tim Peters  added the comment:

I agree hashing a NaN acting like the generic object hash (return rotated 
address) is a fine workaround, although I'm not convinced it addresses a 
real-world problem ;-) But why not? It might.

But that's for CPython. I'm loathe to guarantee anything about this in the 
language itself. If an implementation wants `__contains__()` tests to treat all 
NaNs as equal instead, or consider them equal only if "is" holds, or never 
considers them equal, or resolves equality as bitwise representation equality 
... all are fine by me. There's no truly compelling case to made for any of 
them, although "never considers them equal" is least "surprising" given the 
insane requirement that NaNs never compare equal under IEEE rules, and that 
Python-the-language doesn't formally support different notions of equality in 
different contexts.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-10 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
assignee:  -> rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-10 Thread Mark Dickinson


Mark Dickinson  added the comment:

> Why not have hash() return the id() like we do for instances of object?

I think that's fine, and IIUC that's what Cong Ma was proposing. It seems like 
the least invasive potential fix.

In principle I find the idea of making NaN a singleton rather attractive - the 
performance hit is likely negligible, and it solves a bunch of subtle 
NaN-related issues. (For example, there's still a proposal to provide an IEEE 
754 total_ordering key so that lists of floats can be ordered sanely in the 
presence of nans, but even if you use that you don't know whether your nans 
will end up at the start or the end of the list, and you could potentially have 
nans in both places; fixing a single nan and its sign bit would fix that.) But 
there are bound to be some people somewhere making use of the NaN payload and 
sign bit, however inadvisedly, and Serhiy's right that this couldn't be 
extended to Decimal, where the sign bit and payload of the NaN are mandated by 
the standard.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-04-10 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

Mark, is there any reason hash(float('NaN')) and hash(Decimal('NaN')) have to 
be a constant?  Since NaNs don't compare equal, the hash homomorphism has no 
restrictions.  Why not have hash() return the id() like we do for instances of 
object?

I understand that sys.hash_info.nan would be invalidated, but that was likely 
useless anyway.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-15 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

> And making float('nan') returning a singleton, 
> but 1e1000 * 0 returning different NaN would cause large confusion.

Not really, it would be just be an implementation detail, no different than int 
and strings being sometimes unique and sometimes not.  Historically, immutable 
objects are allowed to be reused when it is convenient for the implementation.

> What about Decimal NaN?

Decimal isn't used as much, so the need is less pressing, but we can do 
whatever is allowed by the spec.  Presumably, that would be easier than with 
floats because we control all possible ways to construct Decimals.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-15 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

What about Decimal NaN? Even if make float NaN a singleton, there will be other 
NaNs.

And making float('nan') returning a singleton, but 1e1000 * 0 returning 
different NaN would cause large confusion.

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-14 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

> The performance thoughts were motivated by the idea of
> making NaN a singleton: adding a check to 
> PyFloat_FromDouble would mean that almost every operation
> that produced a float would have to pass through that check.

It may suffice to move the check upstream from PyFloat_FromDouble so that 
float('NaN') alway produces identically the same object as math.nan.

That would handle the common cases where NaN is used for missing values or is 
generated from string conversions.  We don't need a bullet-proof solution, just 
mitigation of harm.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-13 Thread Mark Dickinson


Mark Dickinson  added the comment:

@Cong Ma: Yes, I'm not worried about performance for the change you're 
proposing (though as you say, we should benchmark anyway). The performance 
thoughts were motivated by the idea of making NaN a singleton: adding a check 
to PyFloat_FromDouble would mean that almost every operation that produced a 
float would have to pass through that check.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-13 Thread Cong Ma


Cong Ma  added the comment:

> Idea:  We could make this problem go away by making NaN a singleton.

Apart from the fact that NaN is not a specific value/object (as pointed out in 
the previous message by @mark.dickinson), currently each builtin singleton 
(None, True, False, etc.) in Python satisfies the following predicate:

`if s is a singleton, then s == s`

This is also satisfied by "flyweight" objects such as small ints.

It feels natural and unsurprising to have this, if not officially promised. 
Requiring NaN to be a singleton would violate this implicit understanding about 
singletons, and cause another surprise on top of the other surprises with NaNs 
(or with rich comparison in general).

Performance-wise, I think the current behaviour (returning _PyHASH_NAN == 0) is 
already nested inside two layers of conditionals in `_Py_HashDouble()` in 
Python/pyhash.c:

```
if (!Py_IS_FINITE(v)) {
if (Py_IS_INFINITY(v))
return v > 0 ? _PyHASH_INF : -_PyHASH_INF;
else
return _PyHASH_NAN;
}
```
(v is the underlying C `double`, field `ob_fval` of `PyFloatObject`).

I don't feel performance would be hurt by rewriting `float_hash()` in 
Objects/floatobject.c to the effect of

```
if (!Py_IS_NAN(v->ob_fval)) {
return _Py_HashDouble(v->ob_fval)
}
else {
return _Py_HashPointer(v);
}
```
and simplify the conditionals in `_Py_HashDouble()`. But of course, only real 
benchmarking could tell. (Also, other float-like types would have to be 
adjusted, too).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-13 Thread Mark Dickinson


Mark Dickinson  added the comment:

> signalling NaNs are quietly silenced on my machine

I just tested this assertion, and it appears that I was lying: this doesn't 
seem to be true. I'm almost *sure* it used to be true, and I'm not sure what 
changed, and when. (FWIW: Python 3.9.2 from macPorts on macOS 10.14.6 + Intel 
hardware. Earlier versions of Python on the same machine give similar results 
to those below, so it's not a core Python change.)

Here's an attempt to create an snan Python float:

Python 3.9.2 (default, Feb 28 2021, 13:47:30) 
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import struct
>>> snan_bits = 0x7ff000123456
>>> snan = struct.unpack(">> snan
nan

Converting back shows that the attempt was successful: the payload, including 
the signalling bit (bit 51, counting with the lsb as bit 0), was preserved:

>>> hex(struct.unpack(">> hex(struct.unpack("

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-13 Thread Mark Dickinson


Mark Dickinson  added the comment:

> We could make this problem go away by making NaN a singleton.

That could work, though we'd annoy anyone who cared about preserving NaN 
payloads and signs in Python. I don't know if such people exist. Currently the 
sign is accessible through things like math.copysign, and the payload through 
struct manipulations (though on most platforms we still don't see the full 
range of NaNs: signalling NaNs are quietly silenced on my machine).

We'd also want to do some performance checks: the obvious way to do this would 
be to have an "is_nan" check in PyFloat_FromDouble. I'd *guess* that a typical 
CPU would do a reasonable job of branch prediction on that check so that it had 
minimal impact in normal use, but only timings will tell for sure.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-12 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

Idea:  We could make this problem go away by making NaN a singleton.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Mark Dickinson


Mark Dickinson  added the comment:

> I also wonder if there's security implication for servers that process 
> user-submitted input

Yes, the "malicious actor" scenario is another one to consider. But unlike the 
string hashing attack, I'm not seeing a realistic way for the nan hash 
collisions to be used in attacks, and I'm content not to worry about that until 
someone gives an actual proof of concept. Many of Python's hash functions are 
fairly predictable (by design!) and there are already lots of other ways to 
deliberately construct lots of hash collisions with non-string non-float values.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Cong Ma


Cong Ma  added the comment:

Sorry, please ignore my rambling about "float() returning aliased object" -- in 
that case the problem with hashing doesn't arise.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Cong Ma


Cong Ma  added the comment:

Thank you @mark.dickinson for the detailed analysis.

In addition to your hypothetical usage examples, I am also trying to understand 
the implications for user code.

If judging by the issues people open on GitHub like this: 
https://github.com/pandas-dev/pandas/issues/28363 yes apparently people do run 
into the "NaN as key" problem every now and then, I guess. (I'm not familiar 
enough with the poster's real problem in the Pandas setting to comment about 
that one, but it seems to suggest people do run into "real world" problems that 
shares some common features with this one). There's also StackOverflow threads 
like this (https://stackoverflow.com/a/17062985) where people discuss hashing a 
data table that explicitly use NaN for missing values. The reason seems to be 
that "[e]mpirical evidence suggests that hashing is an effective strategy for 
dimensionality reduction and practical nonparametric estimation." 
(https://arxiv.org/pdf/0902.2206.pdf).

I cannot imagine whether the proposed change would make life easier for people 
who really want NaN keys to begin with. However I think it might reduce the 
exposure to worst-case performances in dict (and maybe set/frozenset), while 
not hurting Python's own self-consistency.

Maybe there's one thing to consider about future compatibility though... 
because the proposed fix depends on the current behaviour that floats (and by 
extension, built-in float-like objects such as Decimal and complex) are not 
cached, unlike small ints and interned Unicode objects. So when you explicitly 
construct a NaN by calling float(), you always get a new Python object. Is this 
guaranteed now on different platforms with different compilers? I'm trying to 
parse the magic in Include/pymath.h about the definition of macro `Py_NAN`, 
where it seems to me that for certain configs it could be a `static const 
union` translation-unit-level constant. Does this affect the code that actually 
builds a Python object from an underlying NaN? (I apologize if this is a bad 
question). But if it doesn't and we're guaranteed to really have Python 
NaN-objects that don't alias if explicitly constructed, is this something 
unlikely to change in the future?

I also wonder if there's security implication for servers that process 
user-submitted input, maybe running a float() constructor on some string list, 
checking exceptions but silently succeeding with "nan": arguably this is not 
going to be as common an concern as the string-key collision DoS now foiled by 
hash randomization, but I'm not entirely sure.

On "theoretical interest".. yes theoretical interests can also be "real world" 
if one teaches CS theory in real world using Python, see 
https://bugs.python.org/issue43198#msg386849

So yes, I admit we're in an obscure corner of Python here. It's a tricky, 
maybe-theoretical, but seemingly annoying but easy-to-fix issue..

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Mark Dickinson


Mark Dickinson  added the comment:

On third thoughts, of course it *would* help, because the Counter is keeping 
references to the realised NaN values. I think I'll go away now and come back 
when my brain is working.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Mark Dickinson


Mark Dickinson  added the comment:

Hmm. On second thoughts, the proposed solution wouldn't actually *help* with 
the situation I gave: the elements (lazily) realised from the NumPy array are 
highly likely to all end up with the same address in RAM. :-(

>>> x = np.full(10, np.nan)
>>> for v in x:
... print(id(v))
... del v
... 
4601757008
4601757008
4601757008
4601757008
4601757008
4601757008
4601757008
4601757008
4601757008
4601757008

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Mark Dickinson


Mark Dickinson  added the comment:

Sigh. When I'm undisputed ruler of the multiverse, I'm going to make "NaN == 
NaN" return True, IEEE 754 be damned. NaN != NaN is fine(ish) at the level of 
numerics; the problems start when the consequences of that choice leak into the 
other parts of the language that care about equality. NaNs just shouldn't be 
considered "special enough to break the rules" (where the rules here are that 
the equivalence relation being used as the basis of equality for a general 
container should actually *be* an equivalence relation - reflexive, symmetric, 
and transitive).

Anyway, more constructively ...

I agree with the analysis, and the proposed solution seems sound: if we're 
going to change this, then using the object hash seems like a workable 
solution. The question is whether we actually do need to change this.

I'm not too worried about backwards compatibility here: if we change this, 
we're bound to break *someone's* code somewhere, but it's hard to imagine that 
there's *that* much code out there that makes useful use of the property that 
hash(nan) == 0. The most likely source of breakage I can think of is in test 
code that checks that 3rd-party float-like things (NumPy's float64, or gmpy2 
floats, or ...) behave like Python's floats.

@Cong Ma: do you have any examples of cases where this has proved, or is likely 
to prove, a problem in real-world code, or is this purely theoretical at this 
point?

I'm finding it hard to imagine cases where a developer *intentionally* has a 
dictionary with several NaN values as keys. (Even a single NaN as a key seems a 
stretch; general floats as dictionary keys are rare in real-world code that 
I've encountered.) But it's somewhat easier to imagine situations where one 
could run into this accidentally. Here's one such:

>>> import collections, numpy as np
>>> x = np.full(10, np.nan)
>>> c = collections.Counter(x)

That took around 4 minutes of non-ctrl-C-interruptible time on my laptop. (I 
was actually expecting it not to "work" as a bad example: I was expecting that 
NaNs realised from a NumPy array would all be the same NaN object, but that's 
not what happens.) For comparison, this appears instant:

>>> x = np.random.rand(10)
>>> c = collections.Counter(x)

And it's not a stretch to imagine having a NumPy array representing a masked 
binary 1024x1024-pixel image (float values of NaN, 0.0 and 1.0) and using 
Counter to find out how many 0s and 1s there were.

On the other hand, we've lived with the current behaviour for some decades now 
without it apparently ever being a real issue.

On balance, I'm +1 for fixing this. It seems like a failure mode that would be 
rarely encountered, but it's quite unpleasant when it *is* encountered.

--
nosy: +rhettinger, tim.peters

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Mark Dickinson


Change by Mark Dickinson :


--
nosy: +mark.dickinson

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43475] Worst-case behaviour of hash collision with float NaN

2021-03-11 Thread Cong Ma


New submission from Cong Ma :

Summary: CPython hash all NaN values to 0. This guarantees worst-case behaviour 
for dict if numerous existing keys are NaN. I think by hashing NaN using the 
generic object (or "pointer") hash instead, the worst-case situation can be 
alleviated without changing the semantics of either dict or float. However, 
this also requires changes to how complex and Decimal objects hash, and 
moreover incompatible change to sys.hash_info. I would like to hear how Python 
developers think about this matter.



Currently CPython uses the hard-coded macro constant 0 (_PyHASH_NAN, defined in 
Include/pyhash.h) for the hash value of all floating point NaNs. The value is 
part of the sys.hashinfo API and is re-used by complex and Decimal in computing 
its hash in accordance with Python builtin-type documentation. [0]

(The doc [0] specifically says that "[a]ll hashable nans have the same hash 
value.")

This is normally not a great concern, except for the worst case performance. 
The problem is that, since they hash to the same value and they're guaranteed 
to compare unequal to any compatible numeric value -- not even to themselves, 
this means they're guaranteed to collide.

For this reason I'd like to question whether it is a good idea to have all 
hashable NaNs with the same hash value.

There has been some discussions about this over the Web for some time (see 
[1]). In [1] the demo Python script times the insertion of k distinct NaN keys 
(as different objects) into a freshly created dict. Since the keys are distinct 
and are guaranteed to collide with each other (if any), the run time of a 
single lookup/insertion is roughly linear to the existing number of NaN keys. 
I've recreated the same script using with a more modern Python (attached).

I'd suggest a fix for this worst-case behaviour: instead of returning the hash 
value 0 for all NaNs, use the generic object (pointer) hash for these objects. 
As a PoC (also in the attached script), it roughly means

```
class myfloat(float):
def __hash__(self):
if self != self:  # i.e., math.isnan(self)
return object.__hash__(self)
return super().__hash__(self)
```

This will
- keep the current behaviour of dict intact;
- keep the invariant `a == b implies hash(a) == hash(b)` intact, where 
applicable;
- uphold all the other rules for Python numeric objects listed in [0];
- make hash collisions no more likely than with object() instances (dict lookup 
time is amortized constant w.r.t. existing number of NaN keys).

However, it will
- not keep the current rule "All hashable nans have the same hash value";
- not be compatible with the current sys.hash_info API (requires the removal of 
the "nan" attribute from there and documenting the change);
- require congruent modifications in complex and Decimal too.

Additionally, I don't think this will affect module-level NaN "constants" such 
as math.nan and how they behave. The "NaN constant" has never been a problem to 
begin with. It's only the *distinct* NaN objects that may cause the worst-case 
behaviour.



Just for the record I'd also like to clear some outdated info or misconception 
about NaN keys in Python dicts. It's not true that NaN keys, once inserted, 
cannot be retrieved (e.g., as claimed in [1][2]). In Python, they can be, if 
you keep the original key *object* around by keeping a reference to it (or 
obtaining a new one from the dict by iterating over it). This, I think, is 
because Python dict compares for object identity before rich-comparing for 
equality in `lookdict()` in Objects/dictobject.c, so this works for `d = 
dict()`:

```
f = float("nan")
d[f] = "value"
v = d[f]
```

but this fails with `KeyError`, as it should:

```
d[float("nan")] = "value"
v = d[float("nan")]
```

In this regard the NaN float object behaves exactly like the object() instance 
as keys -- except for the collisions. That's why I think at least for floats 
the "object" hash is likely to work. The solution using PRNG [1] (implemented 
with the Go language) is not necessary for CPython because the live objects are 
already distinct.



Links:

[0] https://docs.python.org/3/library/stdtypes.html#hashing-of-numeric-types
[1] https://research.swtch.com/randhash
[2] 
https://readafterwrite.wordpress.com/2017/03/23/how-to-hash-floating-point-numbers/

--
components: Interpreter Core, Library (Lib)
files: nan_key.py
messages: 388508
nosy: congma
priority: normal
severity: normal
status: open
title: Worst-case behaviour of hash collision with float NaN
type: performance
Added file: https://bugs.python.org/file49869/nan_key.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com