Re: [Python-3000] Performance Notes - new hash algorithm

2007-09-12 Thread Nicko van Someren
On 10 Sep 2007, at 01:58, Jim Jewett wrote: > To spell this out a bit more: > ... > When adding four entries to an 8-slot table, a truly random hash would > have at least one collision (0/8 + 1/8 + 2/8 + 3/8 =) 3/4 of the > time. As expected, the proposed hash does have a collision for those > fo

Re: [Python-3000] Performance Notes - new hash algorithm

2007-09-09 Thread Tim Peters
[Larry Hastings] > I see--it's avoiding the Birthday Paradox. It /tends/ to, yes. This wasn't a design goal of the string hash, it's just a property observed after it was adopted, and appreciated much later ;-) It's much clearer for Python's small-int hash, where hash(i) == i for i != -1. That

Re: [Python-3000] Performance Notes - new hash algorithm

2007-09-09 Thread Jim Jewett
On 9/8/07, Tim Peters <[EMAIL PROTECTED]> wrote: > in comments in dictobject.c. As it notes there, hashing the strings > "namea", "nameb", "namec", and "named" currently produces (on a > sizeof(long) == 4 box): > -1658398457 > -1658398460 > -1658398459 > -1658398462 > That the hash codes are ve

Re: [Python-3000] Performance Notes - new hash algorithm

2007-09-09 Thread Larry Hastings
Thomas Wouters wrote: Because (relatively) small dicts with (broadly speaking) similar keys are quite common in Python. Module and class and instance __dict__s, for instance ;) As Tim mentioned, the dict implementation only looks at part of the actual hash value (depending on the size of the di

Re: [Python-3000] Performance Notes - new hash algorithm

2007-09-09 Thread Thomas Wouters
On 9/9/07, Larry Hastings <[EMAIL PROTECTED]> wrote: > One goal of Jenkin's hashes is uniform distribution, so these functions > presumably lack the serendipitous "similar inputs hash to similar values" > behavior of Python's current hash function. But why is that a feature? > (Not that I doubt T

Re: [Python-3000] Performance Notes - new hash algorithm

2007-09-08 Thread Larry Hastings
If the Python community is just noticing the Hsieh hash, that implies that the Bob Jenkins hashes are probably unknown as well. Behold: http://burtleburtle.net/bob/hash/doobs.html To save you a little head-scratching, the functions you want to play with are hashlittle()/hashlittle2() in "l

Re: [Python-3000] Performance Notes - new hash algorithm

2007-09-08 Thread Tim Peters
[Guido] > I'd like Tim Peters's input on this before we change it. I seem to > recall that there's an aspect of non-randomness to the existing hash > function that's important when you hash many closely related strings, > e.g. "0001", "0002", "0003", etc., into a dictionary. Though it's been > so l

Re: [Python-3000] Performance Notes - new hash algorithm

2007-09-07 Thread Guido van Rossum
I'd like Tim Peters's input on this before we change it. I seem to recall that there's an aspect of non-randomness to the existing hash function that's important when you hash many closely related strings, e.g. "0001", "0002", "0003", etc., into a dictionary. Though it's been so long that I may mis

Re: [Python-3000] Performance Notes - new hash algorithm

2007-09-07 Thread Gregory P. Smith
On 9/4/07, Thomas Hunger <[EMAIL PROTECTED]> wrote: > > > Hello, > > I don't know much about python internals, so the following might be > bogus: > > I replaced unicode_hash and string_hash with the hash function from > here: http://www.azillionmonkeys.com/qed/hash.html. > > Then I ran the followin