Looking over the DSPAM docs recently, I saw that it converts all tokens to an 8 byte integer hash using CRC64 and then works only with those fixed length 8 byte numbers instead of variable length strings. CRC64 may not yield perfectly unique results, but it is certainly close enough for the Bayes statistics.

Does anyone here have the experience with SpamAssassin's Bayes processing to be able to guess how much of a difference it would make, if any, if the Bayes db stored fixed length 8 byte integers instead of strings and all the comparisons were of 8 byte integers? How much would that change storage requirements? How would it change I/O requirements reading from and writing to the database?

I'm not putting this in Bugzilla as an RFE yet, because first I would like to get a sense if this seems worth pursuing.

 -- sidney



Reply via email to