On Tue, 09 Mar 2004 17:13:24 +1300, Sidney Markowitz <[EMAIL PROTECTED]> writes:
> Scott A Crosby wrote: > > Using a trick from Quinlan > > Sean Quinlan? Venti? Or are you talking about something that Daniel > Quinlan has done here? If the former, I still don't see how you get > 2^i tokens in the database with an i bit key from a hash. As you > said, the Birthday Paradox says you start getting collisions after > 2^(i/2) tokens. His insight was that the location in the hash table contributes information that is redundant. Implementing it, here's my algorithm: Define three keyed hash functions H1(x) -> Maps a string to a *page* H2(x) -> Maps a string to an offset on a page. H3(x) -> Maps a string to a hash value. I divide the hash table into 'pages'. Consisting of 256 slots for 256 keys each. That means that H1 returns $log(#size)-8$ bits. Since I have the property that a key can ONLY be stored in the page returned by H1, those hash bits are *implicit* and don't need to be stored. Now on each page, I use H2 to choose at what offset on the page to look for the key, with, probably, quadratic probing. If I reach the end of a page, wraparound to the beginning. To disambiguate keys on the same page, Each slot in the hash table stores the output of H3 explicitly. That serves to disambiguate when doing a search. Thus, H3 are the only bits that need explicit storing. With 1 million keys, H1 returns one of 4k pages, H3 returns 28 bits, 10 bits each for counters and 16 bits for circular last-use time, gives 64 bits and because of H1, an effective keysize of 12+28=40 bits. If the hashtable grows in size, H1 gets more bits. Its minimalistic, perhaps too much --- there's no way to rehash or change the table size once it is built --- but 8 bytes/key is hard to beat. I'm sure you can see other variants and different design tradeoffs with this approach. If you're willing to do 11 bytes/key, I'd reject this whole 'implicit bits' and page approach. and store 48 bits hashkey, 2*12 bits counts. 16-bit LRU. That way you can resize and rehash at any time. > We can define what is a reasonable maximum number of tokens (allowing > for eventually having multiword tokens) and how many collisions are > tolerable, and set the number of bits based on that. But I don't think > we would need more than 40 bits of hash. Agreed, except I'd go to 48 bits for robustness. 1 million keys is possible, even likely in some situations. 16 million before the fist birthday paradox is much less so. Scott
