It seems like you wouldn't need to run a dictionary past each token, just make sure the split happens after the 3rd character (so McDonalds or MacDonald doesn't get split, but the majority of the SPAM is correctly tokenized).
Perhaps this technique wouldn't work, but it sounds promising to me... Tom Brown On Nov 25, 2007 9:21 AM, David Legg <[EMAIL PROTECTED]> wrote: > In the past I've reported how effective I've found the Bayesian analysis > filter supplied with James. > > I still find it incredibly effective (roughly 97% of all spam is > rejected). I just thought I'd mention an increasingly common technique > I've noticed over the past couple of months which appears to reduce its > effectiveness. The spammers appear to be producing very short messages > (no more than two lines) and they ConcatenateTheWordsTogetherLikeThis. > > The filter sees this as one big token which it has never seen before and > therefore its effectiveness is reduced. Add to this the spammers > seemingly never ending arsenal of domains and the filter stands no chance. > > The only solution I can think of is some code which tries to break long > tokens apart. A simple technique would be to break tokens up at changes > of case so ConcatenateTheWordsTogetherLikeThis would become > Concatenate^The^Words^Together^Like^This. The ultimate technique would > be to run each token against a dictionary but I think that would be too > costly. > > David - > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
