[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-13 Thread Raymond Hettinger
Raymond Hettinger added the comment: Sweet! ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailm

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-13 Thread Antoine Pitrou
Changes by Antoine Pitrou : -- resolution: accepted -> fixed status: open -> closed ___ Python tracker ___ ___ Python-bugs-list mailing

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-13 Thread Antoine Pitrou
Antoine Pitrou added the comment: pointer_hash5_rotate.patch has been committed to trunk and py3k. Thanks! ___ Python tracker ___ ___ Python-bu

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-13 Thread Antoine Pitrou
Changes by Antoine Pitrou : -- assignee: -> pitrou ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-13 Thread Mark Dickinson
Mark Dickinson added the comment: Antoine, please check in a patch of your choice. I think we've beaten this issue to death already. :-) ___ Python tracker ___ __

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-13 Thread Antoine Pitrou
Antoine Pitrou added the comment: Ok, so the rotate version is really significantly faster (and, as Adam pointed out, it's also theoretically better). ___ Python tracker ___ _

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-13 Thread Mark Dickinson
Changes by Mark Dickinson : Removed file: http://bugs.python.org/file13054/pointer_hash5.patch ___ Python tracker ___ ___ Python-bugs-list mail

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-13 Thread Mark Dickinson
Changes by Mark Dickinson : Added file: http://bugs.python.org/file13070/pointer_hash5_rotate.patch ___ Python tracker ___ ___ Python-bugs-list

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-13 Thread Mark Dickinson
Changes by Mark Dickinson : Added file: http://bugs.python.org/file13069/pointer_hash5_xor.patch ___ Python tracker ___ ___ Python-bugs-list ma

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-13 Thread Mark Dickinson
Mark Dickinson added the comment: > Apart from the misindentation Apologies. My fault for editing Python files while at work, with a substandard editor configuration... > have you run the benchmark script with it? I have now. See attached file for 3 sets of results (original, xor version,

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-13 Thread Mark Dickinson
Changes by Mark Dickinson : Removed file: http://bugs.python.org/file12978/time_object_hash.py ___ Python tracker ___ ___ Python-bugs-list mail

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-12 Thread Adam Olsen
Adam Olsen added the comment: Antoine, x ^= x>>4 has a higher collision rate than just a rotate. However, it's still lower than a statistically random hash. If you modify the benchmark to randomly discard 90% of its contents this should give you random addresses, reflecting a long-running prog

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-12 Thread Antoine Pitrou
Antoine Pitrou added the comment: > Okay, how about this one? Apart from the misindentation (the file should use tabs not spaces), have you run the benchmark script with it? ___ Python tracker

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-12 Thread Mark Dickinson
Mark Dickinson added the comment: > Though I prefer you switch to x ^= x >> 4. Okay, how about this one? Short and sweet. No loss of information except when sizeof(void *) > sizeof(long) (unavoidable until someone finds a way to fit all 64-bit pointers into a 32-bit integer type...) Added fi

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-12 Thread Raymond Hettinger
Raymond Hettinger added the comment: Consider it approved. Though I prefer you switch to x ^= x >> 4. -- resolution: -> accepted ___ Python tracker ___

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-12 Thread Mark Dickinson
Mark Dickinson added the comment: +1 for checking in pointer_hash4.patch, provided Raymond approves. ___ Python tracker ___ ___ Python-bugs-lis

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-12 Thread Antoine Pitrou
Antoine Pitrou added the comment: > Other than that, the patch looks fine to me; x ^= x >> 4 would be fine > too. I've just tried x ^= x >> 4 and the speedup is smaller on our microbenchmark (time_object_hash.py). I conjecture that trying to maintain the sequentiality of adresses may have be

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-12 Thread Mark Dickinson
Mark Dickinson added the comment: > Well, I had memories of a weird signed/unsigned problem (issue4935) and > I wasn't sure whether it could raise its head in the present case or > not. I'm 99.9% sure that it's not a problem here. If it is a problem then it needs to be fixed in long_hash in lo

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-12 Thread Antoine Pitrou
Antoine Pitrou added the comment: Mark: > Just out of interest, why? The cast is unnecessary: there's no ambiguity > or undefinedness (the int -1 gets promoted to unsigned long, with > wraparound semantics), and neither gcc nor MSVC complains. Well, I had memories of a weird signed/unsigned

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-12 Thread Mark Dickinson
Mark Dickinson added the comment: > > - avoids comparing an unsigned long to -1 Just out of interest, why? The cast is unnecessary: there's no ambiguity or undefinedness (the int -1 gets promoted to unsigned long, with wraparound semantics), and neither gcc nor MSVC complains. Other than t

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Raymond Hettinger
Raymond Hettinger added the comment: David, yes, I did mean x ^= x>>4; How embarrassing. ___ Python tracker ___ ___ Python-bugs-list mailing li

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Adam Olsen
Adam Olsen added the comment: Testing with a large set of ids is a good demonstration, but not proof. Forming a set of *all* possible values within a certain range is proof. However, XOR does work (OR definitely does not) — it's a 1-to-1 transformation (reversible as you say.) Additionally, i

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread David W. Lambert
David W. Lambert added the comment: "x |= x>>4" Are you (Ray) sure you didn't mean "x ^= x>>4"? -- nosy: +LambertDW ___ Python tracker ___ _

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Raymond Hettinger
Raymond Hettinger added the comment: [Antoine] > Ok, updated patch: > - uses a 4-bit rotate (not shift) > - avoids comparing an unsigned long to -1 > - tries to streamline the win64 special path (but I can't test) pointer_hash4.patch looks fine to me. Still, I think it's worth considering the

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Adam Olsen
Adam Olsen added the comment: > At four bits, you may be throwing away information and I don't think > that's cool. Even if some selected timings are better with more bits > shifted, all you're really showing is that there is more randomness in > the upper bits than the lower ones. But that do

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Antoine Pitrou
Antoine Pitrou added the comment: Ok, updated patch: - uses a 4-bit rotate (not shift) - avoids comparing an unsigned long to -1 - tries to streamline the win64 special path (but I can't test) Added file: http://bugs.python.org/file13040/pointer_hash4.patch

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Raymond Hettinger
Raymond Hettinger added the comment: Guys, let's be careful. Make sure that efforts to randomize lower bits don't destroy information. Something like x |= x>>8 is reversible and fast. Other fun looking transformations are not: >>> len(set((x >> 4) + (x & 15) for x in range(10**6)))

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Antoine Pitrou
Changes by Antoine Pitrou : Removed file: http://bugs.python.org/file12981/time_object_hash.py ___ Python tracker ___ ___ Python-bugs-list mail

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Antoine Pitrou
Antoine Pitrou added the comment: Here is an updated benchmark script, for both object() and an user-defined class, and adding dict lookup, set lookup and set difference. Set difference is massively faster: up to 60% faster. Added file: http://bugs.python.org/file13039/time_object_hash.py

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Antoine Pitrou
Antoine Pitrou added the comment: Here is an updated patch. It uses a 4-bit shift and an addition. We should avoid the use of logical or, because it makes the outputs non-uniformly distributed ('1' bits are more likely). Added file: http://bugs.python.org/file13038/pointer_hash3.patch

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Mark Dickinson
Mark Dickinson added the comment: > At four bits, you may be throwing away information and I don't think > that's cool. The current patch *does* do a rotation: it doesn't throw away any information. ___ Python tracker __

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Antoine Pitrou
Antoine Pitrou added the comment: > I'm *much* more comfortable with a byte-swap, rotation, or xoring-in > upper bits than with shifts that potentially destroy entropy. > Otherwise, your taxing apps that build giant sets/dicts and need all > distinguishing bits to avoid collision pile-ups. Wou

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Raymond Hettinger
Raymond Hettinger added the comment: At four bits, you may be throwing away information and I don't think that's cool. Even if some selected timings are better with more bits shifted, all you're really showing is that there is more randomness in the upper bits than the lower ones. But that doe

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Mark Dickinson
Mark Dickinson added the comment: I'm fine with rotating 4 bits instead of 3, especially if the timings look good on 32-bit as well as 64-bit. We should really benchmark dict lookups (and set membership tests) as well as dict creation. ___ Python tracker

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Adam Olsen
Adam Olsen added the comment: Antoine, I only meant list() and dict() to be an example of objects with a larger allocation pattern. We get a substantial benefit from the sequentially increasing memory addresses, and I wanted to make sure that benefit wasn't lost on larger allocations than objec

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Chema Cortés
Changes by Chema Cortés : -- nosy: +chemacortes ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.pyth

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Mark Dickinson
Mark Dickinson added the comment: Here's an updated patch, that errs on the conservative side: - rotate instead of shifting, as suggested by Raymond. This costs very little, and I admit to feeling uncomfortable about the possibility of just throwing bits away - explicit check for -1 - s

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Antoine Pitrou
Antoine Pitrou added the comment: Le mercredi 11 février 2009 à 03:31 +, Adam Olsen a écrit : > > .. although, if I replace object() with list() I get best results with a > shift of 6 bits. Replacing it with dict() is best with 8 bits. But list() and dict() don't use id() for hash. _

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Mark Dickinson
Mark Dickinson added the comment: [Adam Olsen] > The alignment requirements (long double) make it impossible to have > anything in those bits. Not necessarily, since not all pointers passed to _Py_HashPointer come from a PyObject_Malloc. _Py_HashPointer is used for function pointers as well.

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-11 Thread Adam Olsen
Adam Olsen added the comment: The alignment requirements (long double) make it impossible to have anything in those bits. Hypothetically, a custom allocator could lower the alignment requirements to sizeof(void *). However, rotating to the high bits is pointless as they're the least likely to

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-10 Thread Raymond Hettinger
Raymond Hettinger added the comment: Instead of a shift, how about a rotate or byteswap in case the lower bits ever become significant again in some build. ___ Python tracker ___

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-10 Thread Adam Olsen
Adam Olsen added the comment: Upon further inspection, although a shift of 4 (on a 64-bit linux box) isn't perfect for dict, it's fairly close to it and well beyond random hash values. Mixing things more is just gonna lower it towards random values. >>> c() 2: 1, 1, 1, 2, 2, 1,

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-10 Thread Adam Olsen
Adam Olsen added the comment: On my 64-bit linux box there's nothing in the last 4 bits: >>> [id(o)%16 for o in [object() for i in range(128)]] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-10 Thread Antoine Pitrou
Antoine Pitrou added the comment: Le mardi 10 février 2009 à 21:18 +, Antoine Pitrou a écrit : > Note: a 64-bit build shows an even greater allocation unit: > > >>> class C(object): > ...__slots__ = ('x') > ... > >>> l = [C() for i in range(20)] > >>> [id(l[i+1]) - id(l[i]) for i in ra

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-10 Thread Antoine Pitrou
Antoine Pitrou added the comment: Some tests on py3k (32-bit build): >>> l = [object() for i in range(20)] >>> [id(l[i+1]) - id(l[i]) for i in range(len(l)-1)] [16, -96, 104, 8, 8, 8, 8, 8, -749528, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8] >>> class C(object): ...__slots__ = () ... >>> l = [C() for i

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-10 Thread Mark Dickinson
Mark Dickinson added the comment: Some comments, while I remember: * the argument to _Py_HashPointer is not always divisible by 8. It's called to create hashes for various objects, including methodobjects; see the line: y = _Py_HashPointer((void*)(a->m_ml->ml_meth)); in meth_hash in

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-09 Thread Jesús Cea Avión
Changes by Jesús Cea Avión : -- nosy: +jcea ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.o

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-08 Thread Antoine Pitrou
Antoine Pitrou added the comment: I observe even greater speedups (15%/20%/37%) on set creation. Here is the updated benchmark script. Added file: http://bugs.python.org/file12981/time_object_hash.py ___ Python tracker ___

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-08 Thread Antoine Pitrou
Antoine Pitrou added the comment: Benchmark results on my machine (64-bit Linux, gcc 4.3.2, AMD X2 3600+): Before: dict creation (selected): 5.09600687027 dict creation (shuffled): 5.66548895836 dict creation: 3.72823190689 After: dict creation (selected): 4.57248306274 (10% speedup) dic

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-08 Thread Antoine Pitrou
Antoine Pitrou added the comment: The code path for SIZEOF_LONG < SIZEOF_VOID_P could probably also benefit from this optimization by casting the pointer to a size_t (this will effect 64-bit Windows, where long is 32 bits but pointers are 64 bits). (unfortunately it seems the 64-bit Windows bui

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-08 Thread Mark Dickinson
Mark Dickinson added the comment: Here are some timings for dict creation, created with the attached script. They're not particularly scientific, but they at least show that this one- line optimization can make a significant difference. Typical results on my machine (OS X 10.5.6/Intel), 32-bi

[issue5186] Reduce hash collisions for objects with no __hash__ method

2009-02-08 Thread Mark Dickinson
New submission from Mark Dickinson : In the issue 5169 discussion, Antoine Pitrou suggested that for an object x without a __hash__ method, id()/8 might be a better hash value than id(), since dicts use the low order bits of the hash as initial key, and the 3 lowest bits of an id() will always