On 08/22/16 07:45, Dianne Skoll wrote:
On Mon, 22 Aug 2016 07:34:00 -0700
Marc Perkel <supp...@junkemailfilter.com> wrote:
So. What percentage of emails using your algorithm are actually
decidable?
Almost 100% if you look at a wide variety of tokens from multiple
attributes.
I can't believe that, or I'm missing something. Almost every spam I see
contains words that also appear in ham. Things like "this" or "invoice"
or "regards" or "dear".
What am I missing?
Hi Dianne, what your missing are word combinations. Usually it's not a
single word but a combination of words that trigger a result.
Example of how NOT matching works
Let’s take 2 subject lines and see how this works.
“Meet hot Russian Brides Online!”
“I read an article about Russian Brides in a magazine”
A traditional spam filter using Bayesian or hard coded rules about
“Russian Brides” might determine that only 1 out of 500 emails
mentioning the phrase “Russian Brides” is a good email. Thus the second
line would have points assessed against it in the classification process
using these traditional methods.
Using the Evolution Filter the phrase “Russian Brides” is in both sets
and therefore has no influence on the results. But the first subject
matches these phrases in the Spam Only set.
“Meet hot”
“Meet hot Russian”
“Meet hot Russian Brides”
“hot Russian Brides Online!”
“Russian Brides Online!”
“Brides Online!”
“Online!”
The second subject matches these phrases on the ham only set that are
never used on the spam set.
“I read an article”
“read an article”
“read an article about”
“about Russian”
“an article about”
“in a magazine”
“Brides in a”
So even though the phrase “Russian Brides” has no influence each subject
hits either ham or spam many times where the same phrase was never used
in the subject line in the opposite set. And the number of hits is
significant enough just from these subjects to cause the fingerprints to
be learned, and that’s just looking at the Subject attribute. When this
is combined with testing all attributes the messages usually come out
strongly on one side or the other.
In rule based systems one would not normally build a white list rule to
to allocate points based on seeing the phrase “read an article about”.
That’s where the Evolution Filter is different. It didn’t need to have
that rule because since it is comparing to the infinite set of what is
not matched on the other side, it dynamically create billions of rules
automatically.
[edit
<http://wiki.junkemailfilter.com/index.php?title=The_Evolution_Spam_Filter&action=edit§ion=6>]
--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400