[issue34751] Hash collisions for tuples

2018-11-05 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: Many thanks! -- ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue34751] Hash collisions for tuples

2018-10-27 Thread Raymond Hettinger
Change by Raymond Hettinger : -- resolution: -> fixed stage: patch review -> resolved status: open -> closed ___ Python tracker ___

[issue34751] Hash collisions for tuples

2018-10-27 Thread Raymond Hettinger
Raymond Hettinger added the comment: New changeset aeb1be5868623c9cd9cf6d7de3015a43fb005815 by Raymond Hettinger (jdemeyer) in branch 'master': bpo-34751: improved hash function for tuples (GH-9471) https://github.com/python/cpython/commit/aeb1be5868623c9cd9cf6d7de3015a43fb005815

[issue34751] Hash collisions for tuples

2018-10-10 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > it will typically change only the last two bits of the final result Which is great if all that you care about is avoiding collisions. -- ___ Python tracker

[issue34751] Hash collisions for tuples

2018-10-09 Thread Tim Peters
Tim Peters added the comment: >> Changes initialization to add in the length: > What's the rationale for that change? You always asked > me to stay as close as possible to the "official" hash function > which adds in the length at the end. Is there an actual benefit > from doing it in the

[issue34751] Hash collisions for tuples

2018-10-09 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: I pushed an update at PR 9471. I think I took into account all your comments, except for moving the length addition from the end to the begin of the function. -- ___ Python tracker

[issue34751] Hash collisions for tuples

2018-10-09 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > Changes initialization to add in the length: What's the rationale for that change? You always asked me to stay as close as possible to the "official" hash function which adds in the length at the end. Is there an actual benefit from doing it in the

[issue34751] Hash collisions for tuples

2018-10-08 Thread Tim Peters
Tim Peters added the comment: We need to worry about timing too :-( I'm using this as a test. It's very heavy on using 3-tuples of little ints as dict keys. Getting hash codes for ints is relatively cheap, and there's not much else going on, so this is intended to be very sensitive to

[issue34751] Hash collisions for tuples

2018-10-07 Thread Tim Peters
Tim Peters added the comment: Here's htest.py output from git pull https://github.com/jdemeyer/cpython.git bpo34751 and a variation. The number of collisions in the variation appear in square brackets at the end of each line. 32-bit build: range(100) by 3; 32-bit hash codes; mean 116.42

[issue34751] Hash collisions for tuples

2018-10-07 Thread Tim Peters
Tim Peters added the comment: Attaching htest.py so we have a common way to compare what various things do on the tests that have been suggested. unittest sucks for that. doctest too. Here's current code output from a 32-bit build; "ideally" we want "got" values not much larger than

[issue34751] Hash collisions for tuples

2018-10-07 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: I updated PR 9471 with a tuple hash function based on xxHash. The only change w.r.t. the official xxHash specification is that I'm not using parallellism and just using 1 accumulator. Please have a look. -- ___

[issue34751] Hash collisions for tuples

2018-10-06 Thread Tim Peters
Tim Peters added the comment: Thinking about the way xxHash works prompted me to try this initialization change: x = 0x345678UL + (Py_uhash_t)len; // adding in the length is new That was a pure win in my tests, slashing the number of collisions in the new tuple test, 32-bit build, from

[issue34751] Hash collisions for tuples

2018-10-06 Thread Tim Peters
Tim Peters added the comment: As a sanity check, I tried the 32-bit version of MurmurHash2 (version 3 doesn't have a 32-bit version). All of my Python tests had collisions, and while the new tuple test dropped to 15, product([0.5, 0.25], repeat=20) skyrocketed from 141 (32-bit xxHash) to

[issue34751] Hash collisions for tuples

2018-10-04 Thread Tim Peters
Tim Peters added the comment: >> people already wrote substantial test suites dedicated >> to that sole purpose, and we should aim to be "mere >> consumers" of functions that pass _those_ tests. > There are hash functions that pass those tests which > are still bad in practice when used as

[issue34751] Hash collisions for tuples

2018-10-04 Thread Tim Peters
Change by Tim Peters : -- components: +Interpreter Core -XML ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue34751] Hash collisions for tuples

2018-10-04 Thread Tim Peters
Tim Peters added the comment: >> In the 64-bit build there are no collisions across my >> tests except for 11 in the new tuple test. > That's pretty bad actually. With 64 bits, you statistically > expect something in the order of 10**-8 collisions. So > what you're seeing is 9 orders of

[issue34751] Hash collisions for tuples

2018-10-04 Thread Tim Peters
Tim Peters added the comment: > Note that I'm always considering parametrized > versions of the hash functions that I'm testing. > I'm replacing the fixed multiplier (all algorithms > mentioned here have such a thing) by a random > multiplier which is 3 mod 8. I've explained before in some

[issue34751] Hash collisions for tuples

2018-10-04 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > Taking an algorithm in wide use that's already known to get a top score on > SMHasher and fiddling it to make a "slight" improvement in one tiny Python > test doesn't make sense to me. OK, I won't do that. The difference is not that much anyway (it

[issue34751] Hash collisions for tuples

2018-10-04 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > In the 64-bit build there are no collisions across my tests except for 11 in > the new tuple test. That's pretty bad actually. With 64 bits, you statistically expect something in the order of 10**-8 collisions. So what you're seeing is 9 orders of

[issue34751] Hash collisions for tuples

2018-10-04 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > Taking an algorithm in wide use that's already known to get a top score on > SMHasher and fiddling it to make a "slight" improvement in one tiny Python > test doesn't make sense to me. What I'm doing is the most innocent change: just applying a fixed

[issue34751] Hash collisions for tuples

2018-10-04 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > people already wrote substantial test suites dedicated to that sole purpose, > and we should aim to be "mere consumers" of functions that pass _those_ tests. There are hash functions that pass those tests which are still bad in practice when used as tuple

[issue34751] Hash collisions for tuples

2018-10-04 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > Note: I'm assuming that by "PRIME32_2" you mean 2246822519U Yes indeed. > and that "MULTIPLIER" means 2654435761U. No, I mean a randomly chosen multiplier which is 3 mod 8. -- ___ Python tracker

[issue34751] Hash collisions for tuples

2018-10-04 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > I've posted several SeaHash cores that suffer no collisions at all in any of > our tests (including across every "bad example" in these 100+ messages), > except for "the new" tuple test. Which it also passed, most recently with 7 > collisions. That was

[issue34751] Hash collisions for tuples

2018-10-03 Thread Tim Peters
Tim Peters added the comment: Here's a complete xxHash-based implementation via throwing away C++-isms from https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxhash.hpp In the 64-bit build there are no collisions across my tests except for 11 in the new tuple test. The 32-bit build

[issue34751] Hash collisions for tuples

2018-10-03 Thread Tim Peters
Tim Peters added the comment: Thanks for looking at xxHash! An advantage is that it already comes in 32- and 64-bit versions. > A (simplified and slightly modified version of) xxHash > seems to work very well, much better than SeaHash. ? I've posted several SeaHash cores that suffer no

[issue34751] Hash collisions for tuples

2018-10-03 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: A (simplified and slightly modified version of) xxHash seems to work very well, much better than SeaHash. Just like SeaHash, xxHash also works in parallel. But I'm not doing that and just using this for the loop: for y in t: y ^= y * (PRIME32_2

[issue34751] Hash collisions for tuples

2018-10-03 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: I'm having a look at xxHash, the second-fastest hash mentioned on https://docs.rs/seahash/3.0.5/seahash/ -- ___ Python tracker ___

[issue34751] Hash collisions for tuples

2018-10-03 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > I know of no such hash functions short of crypto-strength ones. Being crypto-strength and having few collisions statistically are different properties. For non-crypto hash functions it's typically very easy to generate collisions once you know the

[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters
Tim Peters added the comment: > SeaHash seems to be designed for 64 bits. Absolutely. > I'm guessing that replacing the shifts by > > x ^= ((x >> 16) >> (x >> 29)) > > would be what you'd do for a 32-bit hash. My guess too. But "the prime" is a puzzle. As noted before, the 64-bit prime

[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters
Tim Peters added the comment: FYI, this appears to be a good overview of what SMHasher is looking for: https://github.com/aappleby/smhasher/wiki/SMHasher Someg of everything: performance, correctness, statistical measures, and specific kinds of keysets that have proved problematic for other

[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters
Tim Peters added the comment: > So it seems that this SMHasher test suite doesn't > catch the problem that we're seeing with negative integers. Seems to be so, but I've never run SMHasher myself. I believe it's focused on statistical properties, like avalanche and bit independence. >

[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters
Tim Peters added the comment: > >>> from itertools import product > >>> len(set(map(hash, product([0.5, 0.25], repeat=20 > 32 > Good catch! Would you like me to add this to the testsuite? It's in mine already ;-) I've added all the "bad examples" in all the messages here. Sooner or

[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters
Tim Peters added the comment: >> For that reason, I've only been looking at those that >> scored 10 (best possible) on Appleby's SMHasher[1] test suite > Do you have a list of such hash functions? A world of searching. The ones rated "Excellent" here (previously cited) for a start:

[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters
Tim Peters added the comment: >> I've noted before, e.g., that sticking to a prime >> eliminates a world of regular bit patterns in the >> multiplier. > Why do you think this? 0x1fff is prime :-) Point taken ;-) But "a world of" is not the same as "the universe". For example,

[issue34751] Hash collisions for tuples

2018-10-02 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > For that reason, I've only been looking at those that scored 10 (best > possible) on Appleby's SMHasher[1] test suite Do you have a list of such hash functions? -- ___ Python tracker

[issue34751] Hash collisions for tuples

2018-10-02 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: >>> from itertools import product >>> len(set(map(hash, product([0.5, 0.25], repeat=20 32 Good catch! Would you like me to add this to the testsuite? -- ___ Python tracker

[issue34751] Hash collisions for tuples

2018-10-02 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > For that reason, I've only been looking at those that scored 10 (best > possible) on Appleby's SMHasher[1] test suite, which is used by everyone who > does recognized work in this field. So it seems that this SMHasher test suite doesn't catch the problem

[issue34751] Hash collisions for tuples

2018-10-02 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > I've noted before, e.g., that sticking to a prime eliminates a world of > regular bit patterns in the multiplier. Why do you think this? 0x1fff is prime :-) Having regular bit patterns and being prime are independent properties. To be clear:

[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters
Tim Peters added the comment: > If we, e.g., tested tuples of little floats instead ... Speaking of which: >>> from itertools import product >>> len(set(map(hash, product([0.5, 0.25], repeat=20 32 32 hash codes out of 1048576 distinct two-tuples isn't exactly confidence-inspiring

[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters
Tim Peters added the comment: > So my suggestion remains > > for y in INPUT: >t = hash(y) >t ^= t * SOME_LARGE_EVEN_NUMBER >h ^= t >h *= MULTIPLIER On the face of it, I'd be amazed if that passed SMHasher, because it only propagates bits "to the left". All hashes that score

[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters
Tim Peters added the comment: >> the author wants this transformation to be easily >> invertible, so a prime is necessary > A multiplication by any odd number modulo 2**64 is > invertible. Right! My mistake. > As I argued before, the concept of primes is > meaningless (except for the

[issue34751] Hash collisions for tuples

2018-10-02 Thread Tim Peters
Tim Peters added the comment: A meta-comment: a great deal of work has been done on non-crypto hashes in recent years. I'm not interested in rolling our own if one of the "winners" from that process can be adapted. For that reason, I've only been looking at those that scored 10 (best

[issue34751] Hash collisions for tuples

2018-10-02 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > 100% pure SeaHash does x ^= t at the start first, instead of `t ^ (t << 1)` > on the RHS. Indeed. Some initial testing shows that this kind of "input mangling" (applying such a permutation on the inputs) actually plays a much more important role to avoid

[issue34751] Hash collisions for tuples

2018-10-02 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: Correction: the FNV variant of SeaHash only fails the new testsuite, not the old one. The DJB variant of SeaHash fails both. -- ___ Python tracker

[issue34751] Hash collisions for tuples

2018-10-02 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: SeaHash seems to be designed for 64 bits. I'm guessing that replacing the shifts by x ^= ((x >> 16) >> (x >> 29)) would be what you'd do for a 32-bit hash. Alternatively, we could always compute the hash with 64 bits (using uint64_t) and then truncate at

[issue34751] Hash collisions for tuples

2018-10-02 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: This weekend I realized something important which I didn't realize before: some hash functions which I assumed to be good (i.e. small chance of collisions between any given two tuples) turned out to often fail the tests. This is because you don't just want

[issue34751] Hash collisions for tuples

2018-10-02 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > the author wants this transformation to be easily invertible, so a prime is > necessary A multiplication by any odd number modulo 2**64 is invertible. As I argued before, the concept of primes is meaningless (except for the prime 2) when computing modulo

[issue34751] Hash collisions for tuples

2018-10-01 Thread Tim Peters
Tim Peters added the comment: >Py_uhash_t t = (Py_uhash_t)y; >x ^= t ^ (t << 1); >x *= (Py_uhash_t)0x6eed0e9da4d94a4fULL; >x ^= (x >> 32) >> (x >> 60); >x *= (Py_uhash_t)0x6eed0e9da4d94a4fULL; Comment out either one of the multiplies, and it still passes all the tests.

[issue34751] Hash collisions for tuples

2018-10-01 Thread Tim Peters
Tim Peters added the comment: If SeaHash is interesting to us (it is to me), I suggest dropping our DJB/FNV combining part entirely and using a purer form, like so: Py_uhash_t t = (Py_uhash_t)y; x ^= t ^ (t << 1); x *= (Py_uhash_t)0x6eed0e9da4d94a4fULL; x ^= (x >> 32) >> (x

[issue34751] Hash collisions for tuples

2018-10-01 Thread Tim Peters
Tim Peters added the comment: Noting that the failure of high-order bits to propagate is one form of "avalanche" failure. A recent 64-bit hash, SeaHash, takes this approach which is said to have provably "perfect" avalanche behavior: Py_uhash_t t = (Py_uhash_t)y; t *=

[issue34751] Hash collisions for tuples

2018-09-30 Thread Tim Peters
Tim Peters added the comment: An "aha!" moment for me: I noted before that replacing the tuple hash loop with Py_uhash_t t = (Py_uhash_t)y; t ^= t << 1; x = (x * mult) + t; does OK (64-bit build): most tests had no collisions, but the original tuple test had 24 (a modest

[issue34751] Hash collisions for tuples

2018-09-29 Thread Tim Peters
Tim Peters added the comment: Jeroen, thanks for helping us fly slightly less blind! ;-) It's a lot of work. I'd say you may as well pick a prime. It's folklore, but "a reason" is that you've discovered that regular bit patterns in multipliers can hurt, and sticking to primes eliminates

[issue34751] Hash collisions for tuples

2018-09-28 Thread Tim Peters
Tim Peters added the comment: [Tim] > Perhaps worth noting that FNV-1a works great if used as > _intended_: on a stream of unsigned bytes. > ... > >Py_uhash_t t = (Py_uhash_t)y; >for (int i = 0; i < sizeof(t); ++i) { >x = (x ^ (t & 0xff)) * (Py_uhash_t)1099511628211ULL; >

[issue34751] Hash collisions for tuples

2018-09-28 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: I spent about 2 days doing an extensive study of the FNV and DJB algorithms. I share my conclusions below. To be very clear what I mean, I am talking about the following algorithms (t is a tuple and m is the multiplier which is always assumed to be odd).

[issue34751] Hash collisions for tuples

2018-09-28 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > Replacing DJBX33A's multiplier of 33 is also a different algorithm. So is > working with inputs other than unsigned bytes. I would argue that this is just extending the parameters of the algorithm. If the algorithm is general enough, then that shouldn't

[issue34751] Hash collisions for tuples

2018-09-27 Thread Tim Peters
Tim Peters added the comment: Also worth noting: other projects need to combine hashes too. Here's a 64-bit version of the highly regarded C++ Boost[1] library's hash_combine() function (I replaced its 32-bit int literal with "a random" 64-bit one): x ^= (Py_uhash_t)y +

[issue34751] Hash collisions for tuples

2018-09-27 Thread Tim Peters
Tim Peters added the comment: Perhaps worth noting that FNV-1a works great if used as _intended_: on a stream of unsigned bytes. All tests except the new tuple hash test suffer no collisions; the new test suffers 14. Nothing is needed to try to worm around nested tuple catastrophes, or

[issue34751] Hash collisions for tuples

2018-09-27 Thread Tim Peters
Tim Peters added the comment: I should have spelled this out before: these are all permutations, so in general permuting the result space of `x * mult + y` (or any other permutation involving x and y) is exactly the same as not permuting it but applying a different permutation to y

[issue34751] Hash collisions for tuples

2018-09-26 Thread Tim Peters
Tim Peters added the comment: >> The two-liner above with the xor in the second line is >> exactly Bernstein 33A, followed by a permutation >> of 33A's _output_ space. > Not output space, but internal state ? 33A's output _is_ its internal state at the end. This is a distinction that

[issue34751] Hash collisions for tuples

2018-09-26 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > The two-liner above with the xor in the second line is exactly Bernstein 33A, > followed by a permutation of 33A's _output_ space. Not output space, but internal state (I assume that you do that operation inside the loop). It's replacing DJBX33A by a

[issue34751] Hash collisions for tuples

2018-09-26 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > please restore the original tuple hash test. Sure. I wasn't sure what to do and was I afraid that having 2 tests for tuple hashes would be too much. If that's OK for you, then surely I will restore the test. --

[issue34751] Hash collisions for tuples

2018-09-26 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > Do you believe any other multiplier would work better toward that end? Absolutely. Ideally, the multiplier should just be a random 64-bit number. -- ___ Python tracker

[issue34751] Hash collisions for tuples

2018-09-26 Thread Tim Peters
Tim Peters added the comment: High-order bit: please restore the original tuple hash test. You have the worst case of "but I didn't write it" I've ever encountered ;-) Your new test is valuable, but I've seen several cases now where it fails to detect any problems where the original test

[issue34751] Hash collisions for tuples

2018-09-26 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > There are _two_ hash functions at play in that collision: the tuple hash > function, and the integer hash function. Regardless of whether the _tuple_ > hash function does [anything involving just `t`], that only directly affects > the result of the

[issue34751] Hash collisions for tuples

2018-09-25 Thread Tim Peters
Tim Peters added the comment: >> j is even implies (j ^ -3) == -(j ^ 3) > This follows from what I posted before: if j is even, then > j ^ 3 is odd, so we can apply the rule x ^ -2 = -x to x = j ^ 3 > ... Thanks! That helps a lot. I had a blind spot there. This kind of thing would have

[issue34751] Hash collisions for tuples

2018-09-25 Thread Tim Peters
Tim Peters added the comment: > Suppose that there is a hash collision, say hash((3, 3)) == > hash((-3, -3)) and you change the hashing algorithm to fix > this collision. There are _two_ hash functions at play in that collision: the tuple hash function, and the integer hash function.

[issue34751] Hash collisions for tuples

2018-09-25 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: Regarding t ^= t << 7: I tested PR 9471 with all shift values from 1 to 20. The new testsuite passed for all shifts from 1 to 13, except for 6. It failed for 6 and for 14 to 20. This indicates that smaller shift values are better (even when looking at

[issue34751] Hash collisions for tuples

2018-09-25 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > I want to leave low-order hash bits alone. That's deliberate. Since I didn't understand the rationale for this and since shifting << 1 also seems to work well, I edited PR 9471 to use DJBX33A with t ^= t << 1. Since you insisted on adding 97531 at the

[issue34751] Hash collisions for tuples

2018-09-25 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > And one more: x = (x * mult) ^ t; also appears to work equally well. The order of operations does not really matter: you can write the loop as x *= mult # Appears only in FNV-1 x ^= t[0] x *= mult x ^= t[1] x *= mult x ^= t[2] x *= mult x ^=

[issue34751] Hash collisions for tuples

2018-09-25 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > j is even implies (j ^ -3) == -(j ^ 3) This follows from what I posted before: if j is even, then j ^ 3 is odd, so we can apply the rule x ^ -2 = -x to x = j ^ 3: (j ^ 3) ^ -2 = -(j ^ 3) which implies j ^ (3 ^ -2) = -(j ^ 3) or equivalently j ^ -3 =

[issue34751] Hash collisions for tuples

2018-09-25 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > When testing what, specifically? And the standard 32-bit FNV multiplier, or > the standard 64-bit FNV multiplier? FNV-1a with the t ^= 2 * t mangling running my new testsuite on either PR 9471 or PR 9534 using the 64-bit FNV multiplier to produce 64-bit

[issue34751] Hash collisions for tuples

2018-09-25 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > The low bits are already un-improvable in the most important cases. Maybe you misunderstood me. Suppose that there is a hash collision, say hash((3, 3)) == hash((-3, -3)) and you change the hashing algorithm to fix this collision. If that change does NOT

[issue34751] Hash collisions for tuples

2018-09-24 Thread Tim Peters
Tim Peters added the comment: And one more: x = (x * mult) ^ t; also appears to work equally well. So, way back when, it appears we _could_ have wormed around the disaster du jour just by applying a shift-xor permutation to the raw hash results. Note the implication: if we

[issue34751] Hash collisions for tuples

2018-09-24 Thread Tim Peters
Tim Peters added the comment: Just noting that this Bernstein-like variant appears to work as well as the FNV-1a version in all the goofy ;-) endcase tests I've accumulated: while (--len >= 0) { y = PyObject_Hash(*p++); if (y == -1) return -1;

[issue34751] Hash collisions for tuples

2018-09-24 Thread Tim Peters
Tim Peters added the comment: > advantage of my approach is that high-order bits become more > important: I don't much care about high-order bits, beyond that we don't systematically _lose_ them. The dict and set lookup routines have their own strategies for incorporating high-order bits

[issue34751] Hash collisions for tuples

2018-09-24 Thread Tim Peters
Tim Peters added the comment: Jeroen, I understood the part about -2 from your initial report ;-) That's why the last code I posted didn't use -2 at all (neither -1, which hashes to -2). None of the very many colliding tuples contained -2 in any form. For example, these 8 tuples all have

[issue34751] Hash collisions for tuples

2018-09-24 Thread Tim Peters
Tim Peters added the comment: > when you do t ^= t << 7, then you are not changing > the lower 7 bits at all. I want to leave low-order hash bits alone. That's deliberate. The most important tuple component types, for tuples that are hashable, are strings and contiguous ranges of "not

[issue34751] Hash collisions for tuples

2018-09-24 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: I created a new PR based on Tim's t ^= t << 7 idea, except that I'm using << 1 instead, to have more mixing in the lower bits. With the standard FNV multiplier on 64 bits, I did get collisions while testing. I haven't figured out exactly why these occurred,

[issue34751] Hash collisions for tuples

2018-09-24 Thread Jeroen Demeyer
Change by Jeroen Demeyer : -- pull_requests: +8937 ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue34751] Hash collisions for tuples

2018-09-24 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > In the absence of a real analysis, the intuition is simply that "t ^= t << 7" > will clear masses of leading sign bits when hashing "small" negative integers. That's a clever solution. If you want to go that route, I would rather suggest t ^= t << 1 which

[issue34751] Hash collisions for tuples

2018-09-24 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > For example, FNV-1a has far better "avalanche" behavior than Bernstein We don't care about that though. We want to have no collisions, not a random output. -- ___ Python tracker

[issue34751] Hash collisions for tuples

2018-09-24 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > stuff like "t += t >> 16" is a many-to-one function, not a permutation Yes, I am aware of that. However, the number of collisions here is really quite small. It's very unlikely to hit one by accident. I also chose >> over << for two reasons: 1. It brings

[issue34751] Hash collisions for tuples

2018-09-24 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: While writing up the analysis above, it occurred to me that collisions already happen for 2-tuples: >>> hash((3, -2)) == hash((-3, 0)) True These kind of 2-tuples of small integers don't look contrived at all. I can easily see them appearing, in

[issue34751] Hash collisions for tuples

2018-09-24 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > Has anyone figured out the real source of the degeneration when mixing in > negative integers? The underlying reason for the collisions is the following mathematical relation: x ^ -x = -2 << i where i is the index of the smallest set bit of x. In

[issue34751] Hash collisions for tuples

2018-09-23 Thread Tim Peters
Tim Peters added the comment: BTW, those tests were all done under a 64-bit build. Some differences in a 32-bit build: 1. The test_tuple hash test started with 6 collisions. With the change, it went down to 4. Also changing to the FNV-1a 32-bit multiplier boosted it to 8. The test

[issue34751] Hash collisions for tuples

2018-09-23 Thread Tim Peters
Tim Peters added the comment: FYI, using this for the guts of the tuple hash works well on everything we've discussed. In particular, no collisions in the current test_tuple hash test, and none either in the cases mixing negative and positive little ints. This all remains so using the

[issue34751] Hash collisions for tuples

2018-09-23 Thread Tim Peters
Tim Peters added the comment: [Raymond, on boosting the multiplier on 64-bit boxes] > Yes, that would be perfectly reasonable (though to some > extent the objects in the tuple also share some of the > responsibility for getting all bits into play). It's of value independent of that. Tuples

[issue34751] Hash collisions for tuples

2018-09-23 Thread Tim Peters
Tim Peters added the comment: Has anyone figured out the real source of the degeneration when mixing in negative integers? I have not. XOR always permutes the hash range - it's one-to-one. No possible outputs are lost, and XOR with a negative int isn't "obviously degenerate" in general:

[issue34751] Hash collisions for tuples

2018-09-23 Thread Raymond Hettinger
Raymond Hettinger added the comment: > To my eyes, the _strongest_ case in all these messages is > for boosting the multiplier size on 64-bit boxes. That's > principled and well-motivated. Yes, that would be perfectly reasonable (though to some extent the objects in the tuple also share

[issue34751] Hash collisions for tuples

2018-09-23 Thread Tim Peters
Tim Peters added the comment: Oh, I don't agree that it's "broken" either. There's still no real-world test case here demonstrating catastrophic behavior, neither even a contrived test case demonstrating that, nor a coherent characterization of what "the problem" is. I'm nevertheless open

[issue34751] Hash collisions for tuples

2018-09-23 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > We shouldn't feel shoved into altering something that we don't agree is broken It *is* broken. You are just denying the reality. That's also the reason that I'm insisting so much: I don't want to push my personal fix (despite that it may seem so), I want

[issue34751] Hash collisions for tuples

2018-09-22 Thread Tim Peters
Tim Peters added the comment: Raymond, I share your concerns. There's no reason at all to make gratuitous changes (like dropping the "post-addition of a constant and incorporating length signature"), apart from that there's no apparent reason for them existing to begin with ;-) Unintended

[issue34751] Hash collisions for tuples

2018-09-22 Thread Tim Peters
Tim Peters added the comment: I strive not to believe anything in the absence of evidence ;-) FNV-1a supplanted Bernstein's scheme in many projects because it works better. Indeed, Python itself used FNV for string hashing before the security wonks got exercised over collision attacks. It

[issue34751] Hash collisions for tuples

2018-09-22 Thread Raymond Hettinger
Raymond Hettinger added the comment: I don't think any change should be made unless we agree that there is a real problem to be solved rather than because the OP is brazenly insistent on forcing through some change. We shouldn't feel shoved into altering something that we don't agree is

[issue34751] Hash collisions for tuples

2018-09-22 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > the made-up hacks Python used to worm around a class of gross flaws its prior > DJBX33X approach suffered, taking DJBX33X out of its original context and > applying it in an area it wasn't designed for. But we know why the DJBX33A hash didn't work (nested

[issue34751] Hash collisions for tuples

2018-09-22 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: Thanks for the reference, I never heard of the FNV hash. Unfortunately, I haven't seen a reference for the rationale of how they pick their multiplier. It's not clear what you are suggesting though. Keep using the FNV-ish hash somehow? Or using a DJBX33A

[issue34751] Hash collisions for tuples

2018-09-22 Thread Tim Peters
Tim Peters added the comment: So you don't know of any directly relevant research either. "Offhand I can't see anything wrong" is better than nothing, but very far from "and we know it will be OK because [see references 1 and 2]". That Bernstein's DJBX33A has been widely used inspires no

[issue34751] Hash collisions for tuples

2018-09-22 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > I'm not aware of any research papers about picking multipliers in this > context, but would love to see one. The only real condition that I can think of is that the order should be large: we do not want MULTIPLIER**n = 1 (mod 2**N) for a small number n.

[issue34751] Hash collisions for tuples

2018-09-22 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > Which was your original suggestion. Which you appear to be opposed to now? > I'm unclear about why, if so. I'm not strictly opposed to that. It's just that I have less confidence in the current ad-hoc hash compared to something following the DJBX33A

[issue34751] Hash collisions for tuples

2018-09-22 Thread Jeroen Demeyer
Jeroen Demeyer added the comment: > I don't know that primes are important here, but neither do I know that > they're _not_ important here. Hashes are effectively computed modulo 2**N. "Primes" are meaningless in that setting (except for the prime 2 which does have a meaning). For example,

  1   2   >