[issue45832] Misleading membersip expression documentation

2021-11-20 Thread Harald Husum


Harald Husum  added the comment:

Serhiy, thanks for responding.

I agree on this being "third-party" when looking at the original post I made, 
and I've reported the issue to the relevant party 
(https://github.com/pandas-dev/pandas/issues/44504). But really, the bigger 
issue here, in my opinion, is that the equality-hash invariance is poorly 
documented at best, and that this poor documentation is the likely "root cause" 
of the third-party issue I described.

If you read further up the thread, you can see that I have some (pretty 
concrete) proposals for how to improve on the current situation. I'd like the 
documentation maintainers to consider them, in whole or in part. If you feel it 
is more appropriate, I can split it out into a new issue, but I wouldn't mind 
having a conversation about it here.

The Python ecosystem will benefit as a whole if we keep an open mind as to how 
such (fundamental) issues can creep into commonly used, infrastructure-critical 
libraries, developed and maintained by excellent engineers.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45832] Misleading membersip expression documentation

2021-11-20 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

I think there is a bug in either numpy.timedelta64.__hash__ or 
panda.Timedelta.__hash__. Please report it to corresponding libraries.

It is not related to the documentation of membership test. If you have 
different hashes for equal objects most dict and set operations will be broken.

I suggest to close this issue as "third party issue".

--
nosy: +serhiy.storchaka
resolution:  -> third party

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45832] Misleading membersip expression documentation

2021-11-19 Thread Harald Husum


Change by Harald Husum :


--
versions: +Python 3.10 -Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45832] Misleading membersip expression documentation

2021-11-18 Thread Harald Husum


Harald Husum  added the comment:

I am realising that me not knowing about the hash invariance is likely a 
symptom of something I have in common with most Python users, but not with 
Python maintainers: Having access to a powerful ecosystem, we mostly get our 
classes from 3rd parties, rather than implement them ourselves. When I do 
define my own classes, I usually don't have to touch the `__hash__` or `__eq__` 
implementations, since I am either subclassing, making a plain dataclass, or 
leaning on `attrs` to help me out. I think it is telling that even the pandas 
core devs are able to mess this up, and it suggests to me that this invariance 
isn't emphasised enough.

Here's a go at specifying what I mean with a backlink:

"""
For sequence container types such as list, tuple, or collections.deque,
the expression `x in y` is equivalent to `any(x is e or x == e for e in y)`.
For container that use hashing, such as dict, set, or frozenset, 
the same equivalence holds, assuming the [hash 
invariance](https://docs.python.org/3/glossary.html#term-hashable).
"""

I just derived this more or less directly from Hettinger's formulation. It 
could probably be made clearer.

I am realising that this, (famous, it seems), hash invariance isn't defined in 
isolation anywhere, making it slightly hard to link to. Any better suggestions 
than the glossary entry for hashable, which has the definition included? To me, 
it seems that such a fundamental assumption/convention/requirement, that isn't 
automatically enforced, should be as easy as possible to point to.

In my search for the definition (prompted by Hettinger) i discovered more 
surprised, by the way.

Surprise 1:
https://docs.python.org/3/library/collections.abc.html?highlight=hashable#collections.abc.Hashable

> ABC for classes that provide the __hash__() method.

Having now discovered the mentioned invariance, I am surprised this isn't 
explicitly formulated (and implemented? haven't checked) as:

"""
ABC for classes that provide the __hash__() and __eq__() methods.
"""

I also think this docstring deserves a backlink to the invariance definition, 
given it's importance, and how easy it is to shoot yourself in the foot. The 
current formulation of this docstring actually reflected what I (naively) 
assumed it meant to be hashable, suggesting this is the place in the docs I got 
my understanding of the term from.

Surprise 2:
https://docs.python.org/3/reference/expressions.html?highlight=hashable#value-comparisons

> The `hash()` result should be consistent with equality. Objects that are 
> equal should either have the same hash value, or be marked as unhashable.

I appreciate that this is mentioned in this section (I was hoping to find it). 
But it feels like a reiteration of the definition of the invariant, and could 
thus be replaced with a backlink, like suggested above. I'd much rather see the 
text real estate be used for a motivating statement (you do't want weird 
behaviour in sets and dicts), and a reminder of the importance of checking the 
__hash__ implementation if you are modifying the __eq__ implementation, in, 
say, some subclass.

Surprise 3:
https://docs.python.org/3/reference/datamodel.html#object.__eq__

> See the paragraph on __hash__() for some important notes on creating hashable 
> objects which support custom comparison operations and are usable as 
> dictionary keys.

Another case of the invariance being mentioned (I appreciate it), but in a way 
where it isn't directly evident that extreme care should be taken when 
modifying an __eq__ implementation. Perhaps another case where the invariance 
should be referred to by link, and the text should focus on the consequences of 
breaking it.

Surprise 4:
https://docs.python.org/3/reference/datamodel.html#object.__hash__

Another definition-in-passing of the invariance:

> The only required property is that objects which compare equal have the same 
> hash value.

Also replaceable by backlink?

There after follows descriptions of some, (in hindsight very important), 
protection mechanisms.

> User-defined classes have __eq__() and __hash__() methods by default; with 
> them, all objects compare unequal (except with themselves) and x.__hash__() 
> returns an appropriate value such that x == y implies both that x is y and 
> hash(x) == hash(y).

> A class that overrides __eq__() and does not define __hash__() will have its 
> __hash__() implicitly set to None.

But yet again, without some motivating statement for why we care about the 
invariance, all of this seems, well, surprising and weird.

Surprise 5:
https://docs.python.org/3/library/functions.html#hash

Perhaps another location where a backlink would be in order, although not sure 
in this case.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 

[issue45832] Misleading membersip expression documentation

2021-11-17 Thread Harald Husum


Harald Husum  added the comment:

Might i then suggest mentioning in the docs that this assumes the invariance, 
combined with a backlink to the definition, instead of a full repeat?

I'm sure the hash invariant is well known to you guys, working on python, but I 
was genuinely surprised by this behaviour.

Not sure I should be ashamed or not, given that my job has revolved around 
python for the last couple of years.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45832] Misleading membersip expression documentation

2021-11-17 Thread Eric V. Smith

Eric V. Smith  added the comment:

I don’t think repeating the hash invariant in multiple places adds anything, I 
think it would just add clutter. I also think the existing docs are easier to 
understand than the version with the hashing containers split out.

--
nosy: +eric.smith

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45832] Misleading membersip expression documentation

2021-11-17 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

This section presumes that the usual hash invariant holds:  a==b implies 
hash(a)==hash(b).  We could repeat that here but I don't think it makes the 
docs better or more useable to require that docs repeat the same facts in 
multiple places.

Alternatively, the sentence could be split to cover both cases:

"""
For sequence container types such as list, tuple, or collections.deque,
the expression `x in y` is equivalent to `any(x is e or x == e for e in y)`.
For container that use hashing, such as dict, set, or frozenset, 
the expression `x in y` is equivalent to `any(x is e or x == e for e in y if 
hash(x) == hash(e))`.
"""

While that is more precise, it borders on being pedantic and likely doesn't 
make the average reader better off.

Consider submitting a feature request to pandas suggesting that they harmonize 
their hash functions with their counterparts in numpy.

--
nosy: +rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45832] Misleading membersip expression documentation

2021-11-17 Thread Harald Husum


New submission from Harald Husum :

https://docs.python.org/3/reference/expressions.html#membership-test-operations

> For container types such as list, tuple, set, frozenset, dict, or 
> collections.deque, the expression `x in y` is equivalent to `any(x is e or x 
> == e for e in y)`.

Yet:

```py
import pandas as pd
import numpy as np

pd_0_dt = pd.Timedelta(0)
np_0_dt = np.timedelta64(0)

cm = (pd_0_dt, np_0_dt)

d1 = {np_0_dt: pd_0_dt}
d2 = {pd_0_dt: np_0_dt}

def test_membership_doc_claim(candidate_members, dct):
for m in candidate_members:
if m in dct:
assert any(m is e or m == e for e in dct)

if any(m is e or m == e for e in dct):
assert m in dct

if __name__ == "__main__":
test_membership_doc_claim(cm, d1) # Fails
test_membership_doc_claim(cm, d2) # Fails

```

Not too surprised, given the td.__hash__() implementation differs between these 
classes, but they are considered equal none the less.

Unsure whether it is the dict implementation or the doc claim that needs to 
budge here.

--
assignee: docs@python
components: Documentation
messages: 406485
nosy: docs@python, eric.araujo, ezio.melotti, harahu, mdk, willingc
priority: normal
severity: normal
status: open
title: Misleading membersip expression documentation
type: behavior
versions: Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com