With 24 million elements you'd probably want a 64bit hash to minimize the risk of collision, the rule of thumb is with 64bit hash key expect a collision when you reach about 2^32 elements in your set. I half of a 128bit MD5 sum (a cryptographic hash so you can only use parts of it if you want) as that is readily available in our systems and so far has not been a bottleneck. I believe there is now a 64bit murmurhash, that would be faster to compute and ideal for what you want. I've been using base-64 encoding when I use my hashes as rowkeys - makes them printable while still being fairly dense, IIRC a 64bit key should be only 11 characters.
-chris On Mar 17, 2011, at 12:30 AM, Pete Haidinyak wrote: > Thanks, I'll give that a try. > > -Pete > > On Thu, 17 Mar 2011 00:23:00 -0700, Ted Dunning <tdunn...@maprtech.com> wrote: > >> Double hashing is a find thing. To actually answer the question, though, I >> would recommend Murmurhash or JOAAT ( >> http://en.wikipedia.org/wiki/Jenkins_hash_function) >> >> On Wed, Mar 16, 2011 at 3:48 PM, Andrey Stepachev <oct...@gmail.com> wrote: >> >>> Try hash table with double hashing. >>> Something like this >>> >>> http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm >>> >>> 2011/3/17 Peter Haidinyak <phaidin...@local.com> >>> >>> > Hi, >>> > This is a little off topic but this group seems pretty swift so I >>> > thought I would ask. I am aggregating a day's worth of log data which >>> means >>> > I have a Map of over 24 million elements. What would be a good algorithm >>> to >>> > use for generating Hash Codes for these elements that cut down on >>> > collisions? I application starts out reading in a log (144 logs in all) >>> in >>> > about 20 seconds and by the time I reach the last log it is taking around >>> > 120 seconds. The extra 100 seconds have to do with Hash Table Collisions. >>> > I've played around with different Hashing algorithms and cut the original >>> > time from over 300 seconds to 120 but I know I can do better. >>> > The key I am using for the Map is an alpha-numeric string that is >>> > approximately 16 character long with the last 4 or 5 character being the >>> > most unique. >>> > >>> > Any ideas? >>> > >>> > Thanks >>> > >>> > -Pete >>> > >