I don't think you'd need worry about incorrectly tokenizing each word.
In your example it doesn't matter if McDonald gets (incorrectly?) split into Mc and Donald since it is up to the Bayesian analysis to detect if the combination of Mc and Donald indicates spam. The point is that at the moment I don't believe the filter gets the chance to make the decision since all it sees is McDonalds as one token.
David - Tom Brown wrote:
It seems like you wouldn't need to run a dictionary past each token, just make sure the split happens after the 3rd character (so McDonalds or MacDonald doesn't get split, but the majority of the SPAM is correctly tokenized).
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
