On Fri, Mar 05, 2004 at 07:36:02PM -0500, Theo Van Dinter wrote: > Definitely, but we either need to test this RSN or punt to 3.1, IMO.
BTW: I would recommend that for a CRC64 implementation, unless someone finds something better, use String::CRC. It's a public-domain licensed bit of code, already in CPAN, and has an XS to do the calculations. Heck, since it's public domain, we could (I believe) include it directly into the SA distro. Just for kicks I did some testing against my 433k bayes tokens -- specifically a very simple "perl -nl" run reading in a file w/ 1 token per line, and doing the appropriate hash function call. Times are for simple read-in (run 3 times and averaged), collision check was run seperately to avoid any time penalties. Algorithm Time (s) Collisions Token Size (bytes) CRC64 0.66 0 8 CRC32 0.57 14 4 CRC16 0.57 358k 2 16-bit perl checksum 0.70 429k 2 MD5 0.98 0 16 SHA1 1.02 0 20 None 0.32 0 ~12 So we see that it takes ~0.3s just to read in the tokens, of average 12 bytes each. I got the same result reading in forced 8 byte tokens. This is avoiding DB_File and anything else, just standard file I/O, 1 line/token at a time. Using the hash results, CRC64 does seem like a good choice. It's the fastest algorithm that has 0 collisions in a decently large db. CRC32 is pretty decent, but I'd rather get rid of collisions than save the ~0.1 seconds. CRC16 and the built-in perl checksum are, as expected, abysmal in terms of collisions. MD5 and SHA1 don't seem to get you anything in this case. Overall, since the I/O time is the same for hash vs non-hash, I don't see a worthwhile benefit to using hashes. Of course, I was doing a fairly non-scientific test -- modifying the Bayes code/using DB_File/etc may produce different results. I would think Bayes in SQL would benefit from the hashing perhaps, since the SQL databases I've seen usually tell you right out that CHAR(8) is better than VARCHAR(8), aka: fixed size is better than variable sized. Randomly Generated Tagline: "There are all of these warnings and incantations and unnatural rituals and everything's veiled in this threat of "you mess with the mayo, the mayo mess with you, man." - Alton Brown, Good Eats, "Mayo Clinc"
pgpEKrzbBv2fo.pgp
Description: PGP signature
