On 08/22/16 07:45, Dianne Skoll wrote:
On Mon, 22 Aug 2016 07:34:00 -0700
Marc Perkel <supp...@junkemailfilter.com> wrote:

So.  What percentage of emails using your algorithm are actually
decidable?
Almost 100% if you look at a wide variety of tokens from multiple
attributes.
I can't believe that, or I'm missing something.  Almost every spam I see
contains words that also appear in ham.  Things like "this" or "invoice"
or "regards" or "dear".

What am I missing?



Hi Dianne, what your missing are word combinations. Usually it's not a single word but a combination of words that trigger a result.


     Example of how NOT matching works

Let’s take 2 subject lines and see how this works.

“Meet hot Russian Brides Online!”
“I read an article about Russian Brides in a magazine”

A traditional spam filter using Bayesian or hard coded rules about “Russian Brides” might determine that only 1 out of 500 emails mentioning the phrase “Russian Brides” is a good email. Thus the second line would have points assessed against it in the classification process using these traditional methods.

Using the Evolution Filter the phrase “Russian Brides” is in both sets and therefore has no influence on the results. But the first subject matches these phrases in the Spam Only set.

“Meet hot”
“Meet hot Russian”
“Meet hot Russian Brides”
“hot Russian Brides Online!”
“Russian Brides Online!”
“Brides Online!”
“Online!”

The second subject matches these phrases on the ham only set that are never used on the spam set.

“I read an article”
“read an article”
“read an article about”
“about Russian”
“an article about”
“in a magazine”
“Brides in a”

So even though the phrase “Russian Brides” has no influence each subject hits either ham or spam many times where the same phrase was never used in the subject line in the opposite set. And the number of hits is significant enough just from these subjects to cause the fingerprints to be learned, and that’s just looking at the Subject attribute. When this is combined with testing all attributes the messages usually come out strongly on one side or the other.

In rule based systems one would not normally build a white list rule to to allocate points based on seeing the phrase “read an article about”. That’s where the Evolution Filter is different. It didn’t need to have that rule because since it is comparing to the infinite set of what is not matched on the other side, it dynamically create billions of rules automatically.


     [edit
     
<http://wiki.junkemailfilter.com/index.php?title=The_Evolution_Spam_Filter&action=edit&section=6>]




--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400

Reply via email to