Hash Code in Object is limited to an int and a quick look at HashMap and Trove's HashMap looks like they are only using 31 bits of that. I am now trying a modified version of what Ted pointed at and it seems to be working very well. I modified the original since only the last few bytes in the key are usually unique so I start at both ends when creating the Hash Code. So far I am half way through an import and the times went down from 24 seconds to 11 seconds on the first few files and have been holding around 13 seconds at half way vs 45 seconds with the old method.
Thanks -Pete -----Original Message----- From: Christopher Tarnas [mailto:[email protected]] On Behalf Of Chris Tarnas Sent: Thursday, March 17, 2011 9:21 AM To: [email protected] Subject: Re: OT - Hash Code Creation With 24 million elements you'd probably want a 64bit hash to minimize the risk of collision, the rule of thumb is with 64bit hash key expect a collision when you reach about 2^32 elements in your set. I half of a 128bit MD5 sum (a cryptographic hash so you can only use parts of it if you want) as that is readily available in our systems and so far has not been a bottleneck. I believe there is now a 64bit murmurhash, that would be faster to compute and ideal for what you want. I've been using base-64 encoding when I use my hashes as rowkeys - makes them printable while still being fairly dense, IIRC a 64bit key should be only 11 characters. -chris On Mar 17, 2011, at 12:30 AM, Pete Haidinyak wrote: > Thanks, I'll give that a try. > > -Pete > > On Thu, 17 Mar 2011 00:23:00 -0700, Ted Dunning <[email protected]> wrote: > >> Double hashing is a find thing. To actually answer the question, though, I >> would recommend Murmurhash or JOAAT ( >> http://en.wikipedia.org/wiki/Jenkins_hash_function) >> >> On Wed, Mar 16, 2011 at 3:48 PM, Andrey Stepachev <[email protected]> wrote: >> >>> Try hash table with double hashing. >>> Something like this >>> >>> http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm >>> >>> 2011/3/17 Peter Haidinyak <[email protected]> >>> >>> > Hi, >>> > This is a little off topic but this group seems pretty swift so I >>> > thought I would ask. I am aggregating a day's worth of log data which >>> means >>> > I have a Map of over 24 million elements. What would be a good algorithm >>> to >>> > use for generating Hash Codes for these elements that cut down on >>> > collisions? I application starts out reading in a log (144 logs in all) >>> in >>> > about 20 seconds and by the time I reach the last log it is taking around >>> > 120 seconds. The extra 100 seconds have to do with Hash Table Collisions. >>> > I've played around with different Hashing algorithms and cut the original >>> > time from over 300 seconds to 120 but I know I can do better. >>> > The key I am using for the Map is an alpha-numeric string that is >>> > approximately 16 character long with the last 4 or 5 character being the >>> > most unique. >>> > >>> > Any ideas? >>> > >>> > Thanks >>> > >>> > -Pete >>> > >
