It's even more screwed up than I thought and it's no wonder that it did not work in SpamBayes. Their results do not reflect on our use of 40-bit hashes of single tokens.
Not only does CRM114 use a 32 bit hash, which you could expect to start getting collisions after 2^16 unique tokens in the database and with the number of collisions increasing by a factor of four with every doubling of the number of unique tokens; not only do they generate 16 multigram tokens for each single token which explodes the number of unique tokens for the database; but when they finally store the frequency counts they squash them into one megabyte tables containing one byte unsigned integers. That's ending up with a 20 bit hash and limiting the counts to 8 bits precision!!!!
So how does CRM114 achieve great results? Maybe by being trained on the author's personal mail and tested on the author's personal mail :-(
He cautions against "overtraining" CRM114. I think that means you have to give it very little data so the 20 bits of hash bucket space don't fill up, which incidentally would keep the 8 bit counters from overflowing too much.
I found an even more enlightening three-message thread in the SpamBayes-Dev mailing list. The first message was someone suggesting using CRM114 type tokenization and hashes. It pretty much parrots the CRM114 author's description of why what it does is good.
http://mail.python.org/pipermail/spambayes-dev/2003-July/000501.html
One reply briefly lists the disastrous results the SpamBayes developers had when they tried CRM114.
http://mail.python.org/pipermail/spambayes-dev/2003-July/000502.html
The other reply is all about how it failed to work when they tried it in SpamBayes, how the hash collisions make it perform terribly once it has learned a certain amount, and how SpamBayes trained on more data performed better than CRM114 trained on not too much data.
http://mail.python.org/pipermail/spambayes-dev/2003-July/000505.html
The only advantage CRM114 has, according to that posting, is that it amplifies the amount of data extracted from a small training set (by using the multigram tokens) and so performs better on the small training sets that it can handle. But that is not better than performance of a system trained on a large enough set using unigram tokens.
Here is something else interesting from that last post:
"I'll add that we've gotten worse results out of every attempt to use hashing to bound the database size, whether or not we also tried CRM114-style wildcard token generation."
That looks bad for using hashes at all, but I am certain that if he had those results they were a reflection of using too small a hash function size. I keep seeing people blindly making use of Jenkins' hash function because it is well respected, simple, and fast, not realizing that its 32 bit length may not be enough for their application. Given a sufficient hash size compared to the database size so that there are no collisions at all, it would be impossible for them to get any different results than with not using a hash. So I conclude that their poor experiences were using too small a hash function.
The bottom line: If we are going to use n-gram instead of unigram tokens, we have to do something to keep the size of the database manageable. I'm not sure how it would be possible to purge the database of useless tokens when further learning may prove them to be useful. Using n-gram tokens is not an easy problem.
-- sidney
