Re: Bayes optimization?

Daniel Quinlan 6 Mar 2004 10:10:46 -0000

Theo Van Dinter <[EMAIL PROTECTED]> writes:

> Using the hash results, CRC64 does seem like a good choice.  It's the
> fastest algorithm that has 0 collisions in a decently large db.  CRC32
> is pretty decent, but I'd rather get rid of collisions than save the
> ~0.1 seconds.  CRC16 and the built-in perl checksum are, as expected,
> abysmal in terms of collisions.  MD5 and SHA1 don't seem to get you
> anything in this case.


The 14 collisions for CRC32 is pretty much negligible.  It would
probably be lower in actuality given that your Bayes DB has truncated
tokens already.

We could make the hashing function a configuration option (please no
autodetection, though).

Just the same, SHA1 wasn't too bad.  The extra time for even a SHA1 is
perhaps negligible.  I suspect if you used the first 32 bits or first 64
bits of the SHA1 you'd get equally good (perhaps better vs. CRC32)
collision rates with the same size.
 
> Overall, since the I/O time is the same for hash vs non-hash, I don't
> see a worthwhile benefit to using hashes.  Of course, I was doing a
> fairly non-scientific test -- modifying the Bayes code/using
> DB_File/etc may produce different results.

I disagree.  I believe using a fixed length key would enable faster and
much more space efficient DB hashing when using a DB capable of using
that to its advantage.  Probably not with DB_File, of course, but other
DBs have options for fixed length keys, and we could even use a custom
DB.
 
Daniel

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting

Re: Bayes optimization?

Reply via email to