Re: [Python-3000] Performance Notes - new hash algorithm

2007-09-12 Thread Nicko van Someren
On 10 Sep 2007, at 01:58, Jim Jewett wrote: > To spell this out a bit more: > ... > When adding four entries to an 8-slot table, a truly random hash would > have at least one collision (0/8 + 1/8 + 2/8 + 3/8 =) 3/4 of the > time. As expected, the proposed hash does have a collision for those > fo

Re: [Python-3000] Performance Notes - new hash algorithm

2007-09-09 Thread Tim Peters
[Larry Hastings] > I see--it's avoiding the Birthday Paradox. It /tends/ to, yes. This wasn't a design goal of the string hash, it's just a property observed after it was adopted, and appreciated much later ;-) It's much clearer for Python's small-int hash, where hash(i) == i for i != -1. That

Re: [Python-3000] Performance Notes - new hash algorithm

2007-09-09 Thread Jim Jewett
On 9/8/07, Tim Peters <[EMAIL PROTECTED]> wrote: > in comments in dictobject.c. As it notes there, hashing the strings > "namea", "nameb", "namec", and "named" currently produces (on a > sizeof(long) == 4 box): > -1658398457 > -1658398460 > -1658398459 > -1658398462 > That the hash codes are ve

Re: [Python-3000] Performance Notes - new hash algorithm

2007-09-09 Thread Larry Hastings
Thomas Wouters wrote: Because (relatively) small dicts with (broadly speaking) similar keys are quite common in Python. Module and class and instance __dict__s, for instance ;) As Tim mentioned, the dict implementation only looks at part of the actual hash value (depending on the size of the di

Re: [Python-3000] Performance Notes - new hash algorithm

2007-09-09 Thread Thomas Wouters
On 9/9/07, Larry Hastings <[EMAIL PROTECTED]> wrote: > One goal of Jenkin's hashes is uniform distribution, so these functions > presumably lack the serendipitous "similar inputs hash to similar values" > behavior of Python's current hash function. But why is that a feature? > (Not that I doubt T

Re: [Python-3000] Performance Notes - new hash algorithm

2007-09-08 Thread Larry Hastings
If the Python community is just noticing the Hsieh hash, that implies that the Bob Jenkins hashes are probably unknown as well. Behold: http://burtleburtle.net/bob/hash/doobs.html To save you a little head-scratching, the functions you want to play with are hashlittle()/hashlittle2() in "l

Re: [Python-3000] Performance Notes - new hash algorithm

2007-09-08 Thread Tim Peters
[Guido] > I'd like Tim Peters's input on this before we change it. I seem to > recall that there's an aspect of non-randomness to the existing hash > function that's important when you hash many closely related strings, > e.g. "0001", "0002", "0003", etc., into a dictionary. Though it's been > so l

Re: [Python-3000] Performance Notes - new hash algorithm

2007-09-07 Thread Guido van Rossum
I'd like Tim Peters's input on this before we change it. I seem to recall that there's an aspect of non-randomness to the existing hash function that's important when you hash many closely related strings, e.g. "0001", "0002", "0003", etc., into a dictionary. Though it's been so long that I may mis

Re: [Python-3000] Performance Notes - new hash algorithm

2007-09-07 Thread Gregory P. Smith
On 9/4/07, Thomas Hunger <[EMAIL PROTECTED]> wrote: > > > Hello, > > I don't know much about python internals, so the following might be > bogus: > > I replaced unicode_hash and string_hash with the hash function from > here: http://www.azillionmonkeys.com/qed/hash.html. > > Then I ran the followin

Re: [Python-3000] Performance Notes

2007-09-07 Thread Nicholas Bastin
On 9/3/07, Nicholas Bastin <[EMAIL PROTECTED]> wrote: > NOTE: This data is time sampling, not call graph. Added time could > come from either more calls, or longer calls. > > +312.9% PyDict_GetItem I've finally managed to get call graph data and it's fairly interesting for this call. I try to f

Re: [Python-3000] Performance Notes

2007-09-04 Thread Thomas Hunger
> I've been doing some profiling of 3.0 vs. 2.6 release builds on > Windows XP for the purpose of hopefully closing the performance > gap. This data is very preliminary, but I thought I'd throw it out > here in case someone else also wanted to look into this. Also, > possibly useful for comparing

Re: [Python-3000] Performance Notes

2007-09-03 Thread Adam Olsen
On 9/3/07, Nicholas Bastin <[EMAIL PROTECTED]> wrote: > I've been doing some profiling of 3.0 vs. 2.6 release builds on > Windows XP for the purpose of hopefully closing the performance gap. > This data is very preliminary, but I thought I'd throw it out here in > case someone else also wanted to l

Re: [Python-3000] Performance Notes

2007-09-03 Thread Guido van Rossum
Interesting! Thanks for doing this. We'll need a lot of this over the coming year. I read in this that the increased cost is largely due to using unicode strings for all variable and attribute names. So the next step might be to optimize the snot out of unicode hashing and introduce the unicode eq

[Python-3000] Performance Notes

2007-09-03 Thread Nicholas Bastin
I've been doing some profiling of 3.0 vs. 2.6 release builds on Windows XP for the purpose of hopefully closing the performance gap. This data is very preliminary, but I thought I'd throw it out here in case someone else also wanted to look into this. Also, possibly useful for comparing against pr