On Wed, 20 Jan 2016 08:52:05 -0800 Marc Perkel <[email protected]> wrote:
> Suppose I get an email with the subject line "Let's get some lunch". > I know it's a good email because spammers never say "Let's go to > lunch". Really? You know that for a fact? > In fact there are an infinite number of words and phrases > that are used in good email that are never ever used in spam. Really? You know that for a fact? [I mean, it's demonstrably false that the number of good words and phrases are "infinite", but I'll give you the benefit of the doubt and assume you mean "very large"] > A new message comes in. It is tokenized and fingerprinted and > hundreds of fingerprints are generated. Then it's all set operations. > the set of fingerprints of the test message is intersected with the > spam and ham corpi creating sub sets of matches. Then you do a set > diff both ways (ham - spam) (spam - ham) and whichever side is bigger > wins. Generally it will match on only one side or very predominately > on one side. I see what you're doing... if there are more tokens that have been seen in ham but NEVER in spam than the other way around, it's hammy, otherwise spammy. But I'm not convinced this will actually work. In fact, it seems that this algorithm is even more susceptible to poisoning than Bayes. Because it only relies on a token *ever*, even once, appearing in a ham or a spam, it's far more sensitive to poisoning... just *one* appearance of a ham-token in a spam poisons that token for all time and vice-versa. > SpamAssassin is all about matching rules. This is all about not > matching. Not matching allows you to compare to an infinite set > rather than a finite set. So when spammers start misspelling words to > not match the rules, my system catches that and makes its own rules. Bayes does that too. And probably with more theoretical justification. > This new method (I'm calling it the Evolution Spam Filter because the > algorithm mimics evolution.) it doesn't just block spammers, it > decimates spammers. It's not just a treatment - it's the cure. I think it's FUSSP. I have no doubt that the algorithm works *for you* because you probably only see a really tiny percentage of the email on the whole Internet. And also, for small data sets, it probably gives results very similar to Bayes which is itself quite effective, especially if you consider multi-word tokens. I doubt it will be or remain any more effective than Bayes if used widely. I'd be happy to be given hard data proving me wrong, though. > The side effects is this is a very fast and simple recursive learner. > What happens is that as people converse by email it learns more words > and phrases about the stuff that people talk about that are never > used in spam. Bayes does that too. > It doesn't have to know what language you are using, it > will learn it on it's own. Bayes does that too. Regards, Dianne.
