Scott A Crosby wrote: [... snip really neat Sean Quinlan trick ...]
I like that! But thinking about it I realize that much of the purpose of the trick is to optimize a case where collisions have to be accounted for, as in the case of archival storage and retrieval (Venti).
We can get away with "lossy" behavior, as long as the statistics are "good enough". What that means is that we can decide what probability of collision is acceptable, set the bit sizes accordingly, and not worry about collisions.
I was going to go on about the probability of collisions, but something just occurred to me:
The bayes database is being used in two different ways. When we are calculating the spam probability, we are only interested in the most significant 15 tokens in the message, and we are not writing to the database. Is there any reason to have the majority of the tokens which are rare available in the database for that purpose? Why do we need a database with millions, or even hundreds of thousands of tokens at all except for when we are performing a learn operation?
Can we run salearn in a batch mode and have it generate small databases that are used by the Bayes rules?
-- sidney
