Re: Bayesian Analysis spam filter is under attack

David Legg Sun, 25 Nov 2007 09:40:28 -0800

I don't think you'd need worry about incorrectly tokenizing each word.

In your example it doesn't matter if McDonald gets (incorrectly?) splitinto Mc and Donald since it is up to the Bayesian analysis to detect ifthe combination of Mc and Donald indicates spam. The point is that atthe moment I don't believe the filter gets the chance to make thedecision since all it sees is McDonalds as one token.


David -

Tom Brown wrote:

It seems like you wouldn't need to run a dictionary past each token,
just make sure the split happens after the 3rd character (so McDonalds
or MacDonald doesn't get split, but the majority of the SPAM is
correctly tokenized).



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Bayesian Analysis spam filter is under attack

Reply via email to