[issue30671] dict: improve lookup function

2017-06-24 Thread Tim Peters

Tim Peters added the comment:

Actually, there is something to be gained here, for smaller tables.  The simple 
formulas for the expected number of probes under uniform hashing are upper 
bounds, and are significantly overstated when the load factor is very high (not 
a concern for Python) or the table is small.  Using exact analysis gives 
smaller values in those cases, which a slow implementation of uniform hashing 
achieves.  The current method does not.  I'll post more about this to 
python-ideas.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30671] dict: improve lookup function

2017-06-19 Thread Tim Peters

Tim Peters added the comment:

BTW, I should have spelled this out earlier:  recall that x and y collide on 
the first probe if and only if

x = y (mod 2**k)   [1]

The second probe never happens unless they do collide on the first probe, so 
when looking at the second probe we can assume that [1] is true.  Since [1] is 
true,

5*(x % 2**k) + 1 = 5*(y % 2**k) + 1

is also true (and there's nothing special here about "5" and "1" - it would 
hold for any multiplier and any addend).  The second probe just adds (x >> 5) % 
2**k to the left side of that and (y >> 5) % 2**k to the right side of that.  
Therefore the second probes collide if and only if (in addition to [1])

x >> 5 = y >> 5 (mod 2**k)  [2]

The primary point is that any analysis that ignores the higher-order bits is 
missing the primary thing that matters ;-)

[2] does reveal some potential weaknesses in the scheme.  For example, if k < 
5, some bits of the hash code have no effect on the probe sequence (e.g., if 
k==1, the first probe depends only on the hash code's last bit, and if there's 
collision then a second-probe collision depends only on bit 2**5 -- bits 2**1 
through 2**4 are effectively ignored - but the table only has 2 slots in the 
k==1 case).  Decreasing PERTURB_SHIFT works against that.

OTOH, if k > 5, some of the bits that contributed to [1] being true also feed 
into whether [2] is true.  Increasing PERTURN_SHIFT works against that.

Setting PERTURB_SHIFT=k (i.e., make it depend on the number of slots) avoids 
both, but dosen't appear to actually perform better than always using 5.

The rub is that collisions just aren't a real problem under the current scheme 
- and hashsim.py shows it's doing about as well on that count as the 
theoretical gold standard ("uniform hashing").

If people have time to spend on dicts, I'd rather see them pursue entirely 
different methods, that guarantee to slash the _worst_ case number of probes.  
hashsim.py also shows that while long collision chains are rare, they do occur 
(and they also occur under "uniform hashing").  Then, e.g., we could drop the 
annoying "hash randomization" cruft.  For example, dig into "cuckoo hashing" 
and "Robin Hood hashing".  Just be warned that, regardless of scheme, the 
devil's in the details :-(

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30671] dict: improve lookup function

2017-06-18 Thread Tim Peters

Tim Peters added the comment:

Dmitry, I suggest you spend more of the time you give to thinking to writing 
code instead ;-)  But, really, that's the easiest & surest way to discover for 
yourself where your analysis is going off in the weeds.

For example, issue 28201 was both simpler and worse than your thoughts about 
it:  x and y collide initially if and only if x = y mod 2**k.  If they do 
collide, then _regardless_ of the value of N it's necessarily the case that N*x 
+ x + 1 = N*y + y + 1 mod 2**k too.  That is, the second probes _always_ 
collided (not just half the time) in the case of a first-probe collision, and 
this would be so regardless of whether we were multiplying by 5, 4, 1, 0, or 
31459265358.  That was an unfortunate "off by 1" kind of mistake in the way the 
code was written.  It had nothing to do with, e.g., zero divisors.

After the obvious fix was applied, there's very little that can be said in 
reality.  Because, after the fix, higher-order bits of the hash codes - which 
had nothing at all to do with the initial x = y mod 2**k collision - are added 
in to the second probe values.  There's no good reason I can see to calculate 
what happens if those never-before-considered bits _happen_ to be all zeroes.  
They might be - but so what?  They might be any pair of values in the cross 
product of range(2**5) with itself.  There's nothing especially interesting 
about the (0, 0) pair.

That's why you're getting pushback:  your analysis hasn't made sense to me, and 
the things you're calculating still don't appear to have much of anything to do 
with how collisions are actually resolved.  To the contrary, so long as you're 
ignoring the higher-order hash code bits (about which we can infer _nothing_ 
from that the first probe collided), you're ignoring the _heart_ of the 
collision resolution strategy.

Some other things writing code would make immediately obvious:

- Hashes of strings have nothing to do with pointers or memory addresses.  They 
have solely to do with the characters the string contains, and all values in 
range(256) show up as the last byte of string hashes.

- While hashes of pointers do, the internal `_Py_HashPointer()` rotates 
addresses right by 4 bits so that the predictable trailing zero bits have no 
effect on the first probe.  For example,

>>> a = object()
>>> id(a)# the address
1583819629360
>>> hex(_)
'0x170c301a330'
>>> hash(a)  # the address is rotated so `0x3` is the last nibble
98988726835
>>> hex(_)
'0x170c301a33'

Because of all that, I'm going to close this.  But if you have an idea that 
actually speeds things, please post a patch and reopen it!  While it's true 
that I don't expect to see an actual improvement, clocks usually beat arguments 
;-)

--
resolution:  -> rejected
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30671] dict: improve lookup function

2017-06-18 Thread Dmitry Rubanovich

Dmitry Rubanovich added the comment:

A few notes on my motivation and other comments I made(not necessarily in the 
order of significance):

1. I started looking at this because of the fix in Issue28201.  It effectively 
improves on the difference between 

print_collision_counts(previous_lookdict_perturb)

and 

print_collision_counts(py3_6_1_lookdict_perturb)

because the before-28201 code was, in effect, multiplying by 6 instead of 5 
and, therefore, always had a 1/2 chance of secondary hashes colliding.

2. What I demonstrated was that the fix, as it stands, didn't do as much magic 
as we would hope.  Because it produced a scheme in which **first** 
**secondary** hashes would collide at a rate as high as 1/3.

3. I somewhat regret extending the script to demonstrate what happens to 2nd, 
3rd secondary hashes because it created the perception that I was trying to 
demonstrate a simulation.  I wasn't.  I was trying to demonstrate a 
calculation.  Well, calculations for each dictionary size up to 2**20.  

4. I detect apathy towards the fix, but I get the sense that it's mostly due to 
the fact that I initially didn't realize that I should have squarely limited 
the issue to the 1st secondary hash.  And after I made the initial error in the 
calculation (as pointed out by Inada Naoki) which would have produced an 
unstable calculation of the 2nd, 3rd, 4th, etc. secondary hashes (as pointed 
out by Tim), there is a lot of push back.  

But the fix I propose DOES improve on ***1st secondary hash*** because it 
eliminates any possibility of collision.  That is to say, no 2 1st secondary 
hashes will be the same for any 2 different primary hashes.  

And in the latest iteration of this proposal I did limit the scope of the 
enhancement to only 1st secondary hash.  As it happens, this collision actually 
does not occur for dictionaries of sizes 8,16,32.  They only begin to be 
possible for dictionaries of sizes 64 or greater.  The latest script 
demonstrates it.

5.  Sorry, Tim, but I should have reserved my statements to only the 
calculation I was certain about.  The further comments about what happens to 
later secondary hashes was a speculation on my part and this calculation was 
not revealing much about it.  If you want to consider what happens with later 
secondary indexes (as you do in hashsim.py), that's a different problem.  To 
reiterate, my intention was to further improve on the fix in 28201.


On the size of perturb:

6.  Since hashes of strings and other objects are pointers, and can safely be 
assumed to be pointers addressing the heap, they are aligned on the boundary 2 
* sizeof(void*).  That's 16 in 64-bit builds.  So the last 4 bits are 0.  And 
(depending on how much heap is used) the higher bits of the pointers are also 
similar in applications which don't use a lot of memory.  

For example, in an application in which 16MiB of heap is used, only bits 
35(=63-4-24) through 59(63-4) are truly random.  The rest of the bits can be 
assumed to not change, but they are not reached until 5th iteration of the 
loop, so this represents an already mildly-degenerate case in this use case (in 
which most dictionary keys are strings).

I haven't given much thought to which bits will have the most entropy in 
general, but that's going too far off topic.

--
type: enhancement -> performance
Added file: http://bugs.python.org/file46961/1st_secondary_collision_counts.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30671] dict: improve lookup function

2017-06-18 Thread Tim Peters

Tim Peters added the comment:

I don't see a reason to keep this open, but I haven't been able to follow the 
OP's line of argument.  My best _guess_ is that they're chasing illusions based 
on not understanding (or grossly undervaluing) that the primary point of the 
perturb logic is to incrementally fold in hash code bits that _didn't_ have any 
effect at all on the initial table index.  It's true that for a hash table with 
2**p slots, the current scheme could be improved in several ways _if_ hash 
codes only had p bits.  To judge from the Python code they posted, they either 
(erroneously) assumed that was the case, or offhandedly dismissed the potential 
value of folding in hash code bits beyond the p'th most significant.

But, since I've been unable to follow the argument, I could be wrong about all 
that.  So if they don't clarify within a few days, I'll close this.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30671] dict: improve lookup function

2017-06-18 Thread Raymond Hettinger

Raymond Hettinger added the comment:

Can this be closed now?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30671] dict: improve lookup function

2017-06-17 Thread Tim Peters

Tim Peters added the comment:

The attached hashsim.py pretty much convinces me there's nothing left to be 
gained here.  It shows that the current scheme essentially achieves the 
performance of theoretical "uniform hashing" for string keys, for both 
successful and unsuccessful searches, as measured by the average number of 
probes needed.  Uniform hashing is the head scheme in which, for each key, the 
probe sequence is chosen uniformly at random from the set of all table slot 
permutations.  Its advantage is that it's feasible to analyze - its 
disadvantage is that it can't actually be sanely implemented ;-)

Back when the current scheme was implemented, string keys sometimes behaved 
"better than random", for reasons explained in the code comments.  The string 
hash now is more like a high-quality PRNG, so the performance of uniform 
hashing is the best that can be hoped for.  Dicts with int keys can still enjoy 
better-than-random performance.

Back then I was using a 32-bit box, and I'm pleased to see that PERTURB_SHIFT=5 
still appears to work well on a 64-bit box (note:  to a first approximation you 
can emulate a 32-bit box on a 64-bit box by setting M64 in the code to a 32-bit 
mask).  But the choice appears far less touchy on a 64-bit box, and it may help 
a little to increase PERTURB_SHIFT.  Not so much in the average case, but for 
pathological cases, like int keys all of which have a lot of trailing zeroes.  
Before the bits that differ get shifted down far enough to matter, they all 
follow the same probe sequence.  Increasing PERTURB_SHIFT would reduce the 
number of iterations needed before that happens.  But I'd wait until someone 
complains before churning the code just for that ;-)

--
Added file: http://bugs.python.org/file46957/hashsim.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30671] dict: improve lookup function

2017-06-16 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I totally concur with Tim.

Your script test all hash codes in range(1<

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30671] dict: improve lookup function

2017-06-16 Thread Tim Peters

Tim Peters added the comment:

Yes, any scheme whatsoever that guarantees to visit every int in range(2**i) 
meets the "correctness" part.

But I concur with Inada:  "head arguments" aren't compelling here, not even if 
I understood what they were saying ;-)  If you implement it and demonstrate 
significant speedups in plausibly realistic settings, then it may be worth 
pursuing.  Anything short of that is at best informed speculation, and so 
better hashed out on, e.g., the python-ideas mailing list than on the bug 
tracker.

BTW, I realize you're not running tests against random data.  But you're not 
running tests against _any_ real data, which is worrisome.  Your code appears 
to be utterly uniform, thus approximating a uniform random distribution.  I 
have scant idea of what you believe your code shows.  For example, it only 
looks at "hash codes" in `range(1<

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30671] dict: improve lookup function

2017-06-16 Thread Dmitry Rubanovich

Dmitry Rubanovich added the comment:

Changing the title, to remove "simplify", since the discussion revealed that 
the potential improvement would have to be treated as a special case and, 
therefore, would not simplify the code.

--
title: dict: simplify and improve lookup function -> dict: improve lookup 
function

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com