On Sun, Feb 22, 2004, Andrew Cowie wrote: > My guess [this is entirely unscientific] is that it backfired on them. > The dictionary is relatively big, but the set of words commonly used > is *really* small in comparison. Because they use words that I and my > correspondents *never* use, the score on uncommon words (take > "lanthanide" and "dispensary". Who are they kidding?) goes up, and > they become clear markers for spam.
I had a pretty similar experience: the "random word" spams were missed initially, but the filter was trained on them successfully. I've never looked at the particular implementation/application of Bayes's theorem in SpamAssassin or bogofilter, but I did do an implementation of Paul Graham's plan for spam for an assignment. The plan for spam specifies that you need to take the X most indicative words in any mail (ie the 15 words that are most polarised towards "ham" or "spam") and combine their probabilities to get a probability that the mail itself is spam or ham. The words in the middle "could be spam, could be ham..." don't count. It seems likely to me that the uncommon words were simply not among the most indicative words. The reason the "random word" spams were missing the filters was that they *also* lacked incredibly spammy words. However, as we trained our filters, the words in them (remember, headers included!) were learned as spammy. As far as I can see, the major weakness in the Bayesian method is that it is quite easy to work out what words are spammy -- I suspect my "most spammy" list looks much like yours, and much like everyone else's. A spammer can assemble a training corpus of spam as easily as we can, work out what words really set a Bayesian filter on fire and avoid them. However, the Bayesian filters do get updated :) The major strength is that the hammy words, especially the strongly hammy, are pretty difficult to guess, because valid mail varies by person (everyone gets pretty similar spam, people get quite different valid mail). -Mary -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
