* The bottom line: If we are going to use n-gram instead of unigram * tokens, we have to do something to keep the size of the database * manageable. I'm not sure how it would be possible to purge the database * of useless tokens when further learning may prove them to be useful. * Using n-gram tokens is not an easy problem.
Suggestion... Don't store very many tokens per message. But, you say, how would we still get matches? You can still get matches if you choose the same restricted set of tokens every time even when the messages are different. Here's how. Use a large hash (md5?) do a numerical sort of all the n-gram tokens generated from a message. Pick the top 50 (or 25 or 200). Chop off the most significant bits from the hash (they're all 1's anyway). Store those 50 (or 25 or 200) remainders. What this does is pseudo-randomly elevate some n-grams over others as more interesting. Since it does this the same way for all messages, the same ones will be elevated in both the database and in the messages being checked against the database. I have a patent on a related technique and may wish to patent this one too, but if I do, I'll certainly allow it's unrestricted use in free software. I don't know for sure that this technique works for this application but I suspect it does. -Dave
