I wonder how this differs from some of the classifiers within CRM114. Several of them seem to work on phrases (with high costs) or single words.

{^_^}

On 2016-01-20 11:05, Dianne Skoll wrote:
On Wed, 20 Jan 2016 08:52:05 -0800
Marc Perkel <[email protected]> wrote:

Suppose I get an email with the subject line "Let's get some lunch".
I know it's a good email because spammers never say "Let's go to
lunch".

Really?  You know that for a fact?

In fact there are an infinite number of words and phrases
that are used in good email that are never ever used in spam.

Really?  You know that for a fact?  [I mean, it's demonstrably false
that the number of good words and phrases are "infinite", but I'll give
you the benefit of the doubt and assume you mean "very large"]

A new message comes in. It is tokenized and fingerprinted and
hundreds of fingerprints are generated. Then it's all set operations.
the set of fingerprints of the test message is intersected with the
spam and ham corpi creating sub sets of matches. Then you do a set
diff both ways (ham - spam) (spam - ham) and whichever side is bigger
wins. Generally it will match on only one side or very predominately
on one side.

I see what you're doing... if there are more tokens that have been
seen in ham but NEVER in spam than the other way around, it's hammy,
otherwise spammy.

But I'm not convinced this will actually work.  In fact, it seems that
this algorithm is even more susceptible to poisoning than Bayes.
Because it only relies on a token *ever*, even once, appearing in a
ham or a spam, it's far more sensitive to poisoning... just *one*
appearance of a ham-token in a spam poisons that token for all time and
vice-versa.

SpamAssassin is all about matching rules. This is all about not
matching. Not matching allows you to compare to an infinite set
rather than a finite set. So when spammers start misspelling words to
not match the rules, my system catches that and makes its own rules.

Bayes does that too.  And probably with more theoretical justification.

This new method (I'm calling it the Evolution Spam Filter because the
algorithm mimics evolution.) it doesn't just block spammers, it
decimates spammers. It's not just a treatment - it's the cure.

I think it's FUSSP.  I have no doubt that the algorithm works *for
you* because you probably only see a really tiny percentage of the
email on the whole Internet.  And also, for small data sets, it
probably gives results very similar to Bayes which is itself quite
effective, especially if you consider multi-word tokens.

I doubt it will be or remain any more effective than Bayes if used
widely.  I'd be happy to be given hard data proving me wrong, though.

The side effects is this is a very fast and simple recursive learner.
What happens is that as people converse by email it learns more words
and phrases about the stuff that people talk about that are never
used in spam.

Bayes does that too.

It doesn't have to know what language you are using, it
will learn it on it's own.

Bayes does that too.

Regards,

Dianne.

Reply via email to