Theo Van Dinter <[EMAIL PROTECTED]> writes: > Using the hash results, CRC64 does seem like a good choice. It's the > fastest algorithm that has 0 collisions in a decently large db. CRC32 > is pretty decent, but I'd rather get rid of collisions than save the > ~0.1 seconds. CRC16 and the built-in perl checksum are, as expected, > abysmal in terms of collisions. MD5 and SHA1 don't seem to get you > anything in this case.
The 14 collisions for CRC32 is pretty much negligible. It would probably be lower in actuality given that your Bayes DB has truncated tokens already. We could make the hashing function a configuration option (please no autodetection, though). Just the same, SHA1 wasn't too bad. The extra time for even a SHA1 is perhaps negligible. I suspect if you used the first 32 bits or first 64 bits of the SHA1 you'd get equally good (perhaps better vs. CRC32) collision rates with the same size. > Overall, since the I/O time is the same for hash vs non-hash, I don't > see a worthwhile benefit to using hashes. Of course, I was doing a > fairly non-scientific test -- modifying the Bayes code/using > DB_File/etc may produce different results. I disagree. I believe using a fixed length key would enable faster and much more space efficient DB hashing when using a DB capable of using that to its advantage. Probably not with DB_File, of course, but other DBs have options for fixed length keys, and we could even use a custom DB. Daniel -- Daniel Quinlan anti-spam (SpamAssassin), Linux, http://www.pathname.com/~quinlan/ and open source consulting
