* The bottom line: If we are going to use n-gram instead of unigram 
* tokens, we have to do something to keep the size of the database 
* manageable. I'm not sure how it would be possible to purge the database 
* of useless tokens when further learning may prove them to be useful. 
* Using n-gram tokens is not an easy problem.

Suggestion...

Don't store very many tokens per message.  But, you say, how would
we still get matches?

You can still get matches if you choose the same restricted set of
tokens every time even when the messages are different.

Here's how.

Use a large hash (md5?) do a numerical sort of all the n-gram tokens
generated from a message.  Pick the top 50 (or 25 or 200).  Chop
off the most significant bits from the hash (they're all 1's anyway).
Store those 50 (or 25 or 200) remainders.

What this does is pseudo-randomly elevate some n-grams over others as
more interesting.  Since it does this the same way for all messages,
the same ones will be elevated in both the database and in the messages
being checked against the database. 

I have a patent on a related technique and may wish to patent this
one too, but if I do, I'll certainly allow it's unrestricted use
in free software.

I don't know for sure that this technique works for this application
but I suspect it does.

-Dave

Reply via email to