Determining number of dict key collisions in a dictionary

2008-12-02 Thread python
Is there any way to determine the number of dictionary key
collisions in a specific dictionary?
Background: I'm working on a project using very large
dictionaries (64 bit Python) and question from my client is how
effective is Python's default hash technique for our data set?
Their concern is based on the belief that Python's default
dictionary hash scheme is optimized for 32 bit vs. 64 bit
environments and may not have anticipated the additional range of
keys that can be generated in a 64 bit environment. Our keys are
based on 20 to 44 byte ASCII (7-bit) alpha-numeric strings.
Thank you,
Malcolm
--
http://mail.python.org/mailman/listinfo/python-list


Re: Determining number of dict key collisions in a dictionary

2008-12-02 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

[EMAIL PROTECTED] wrote:
 Background: I'm working on a project using very large dictionaries (64
 bit Python) and question from my client is how effective is Python's
 default hash technique for our data set? 

Python hash functions return a long which in a 64 bit process is 32 bits
on Windows and 64 bits on pretty much every other 64 bit environment.

 Their concern is based on the
 belief that Python's default dictionary hash scheme is optimized for 32
 bit vs. 64 bit environments and may not have anticipated the additional
 range of keys that can be generated in a 64 bit environment. Our keys
 are based on 20 to 44 byte ASCII (7-bit) alpha-numeric strings.

Why not have them look at the source code?  It is well commented and
there is another file with various notes.  Look at Objects/dictobject.c
and Objects/dictnotes.txt

A teaser comment for you:

   Most hash schemes depend on having a good hash function, in
   the sense of simulating randomness.  Python doesn't.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkk1jsUACgkQmOOfHg372QTeEQCeJwkRphiPeDefkANg1IdG3HH1
oocAoICJk6NGxVmtZTZtLOL4Sv4aCw1n
=IqsO
-END PGP SIGNATURE-

--
http://mail.python.org/mailman/listinfo/python-list


Re: Determining number of dict key collisions in a dictionary

2008-12-02 Thread python
Roger,

Apologies for responding directly to your email vs. the list (just
realized this now).

Thank you very much for your detailed answers and the reminder about
premature optimization.

You have answered all my questions.

Regards,

Malcolm


- Original message -
From: Roger Binns [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Date: Tue, 02 Dec 2008 19:29:05 -0800
Subject: Re: Determining number of dict key collisions in a dictionary

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

[EMAIL PROTECTED] wrote:
 Any ideas on why the Windows 64-bit version of Python has 32-bit vs.
 64-bit longs? (Is this the case for both the 2.6 and 3.0 64-bit versions
 of Python for Windows?)

It is part of the ABI for the platform.

http://www.unix.org/version2/whatsnew/lp64_wp.html
http://en.wikipedia.org/wiki/64-bit#64-bit_data_models
http://blogs.msdn.com/oldnewthing/archive/2005/01/31/363790.aspx

 If this is the case, then it sounds like Linux/Unix are much better
 Python hosts than Windows when it comes to developing 64-bit Python
 applications - especially when it comes to working with large
 dictionaries (you point about hash collisions not withstanding)?

Python uses a type Py_ssize_t which is typedef'ed to 64 bits on a 64 bit
platform and 32 bits on a 32 bit platform for things like sizes (eg
length of strings, number of items in a container) so there won't be any
programmatic or correctness issues.

You never said what your idea of large was in the original posting.  For
example you really only need to start worrying about hash collisions on
LP64 if you have more than 4 billion items.  Even then things will still
work.

For people who have really large data sets, it is usually beneficial to
write their own custom code often resorting to assembly.  This is
because the data set is so much larger than caches and saving even a few
instructions per item pays back due to how many there are.

My advice would be to not worry about collisions, premature optimization
etc initially.  Get the code working and get it working right.  Then do
profiling and improve algorithms.  You will have test code from earlier
so you can verify that the optimizations are correctly.  Finally you may
have to rewrite core parts in assembly. but again the tests will be
there to help.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkk1/P4ACgkQmOOfHg372QQdkACg3n0CrXCbctfyKOw3DQ0uTvvt
J60AoMyikF/oJUXVrV9XOkQe6eprzPSh
=9CYk
-END PGP SIGNATURE-
--
http://mail.python.org/mailman/listinfo/python-list