Am 20.01.2016 um 17:52 schrieb Marc Perkel:
So - how do I get a list of words and phrases never used in spam? I
create a list of words and phrases that are used in spam and check to
see if it's *not on the list*.

What I do is tokenize the spamiest parts of the email, like the subject
line, into words and phrases of 1 2 3 and 4 word phrases.

the quick brown fox jumps over the lazy dog - becomes

"the" "quick" "the quick" "brown" "quick brown" "the quick brown" "fox"
"brown fox" "quick brown fox" "the quick brown fox" "jumps" "fox jumps"
"brown fox jumps" "quick brown fox jumps" "over" "jumps over" "fox jumps
over" "brown fox jumps over" "the" "over the" "jumps over the" "fox
jumps over the" "lazy" "the lazy" "over the lazy" "jumps over the lazy"
"dog" "lazy dog" "the lazy dog" "over the lazy dog"

These tokens are learned as ham or spam and added to sets. I'm using
Redis to do this because it has extremely fast set operations. I don't
know of anything other than Redis that can do this. So think about Redis
as the way to implement this

so what's the difference to bayes?

and for "I'm seeing close to 100% accuracy. It is so accurate it's scary and I think my implementation is crude at best" - my hand-trained and maintained bayes with adjusted scores is for sure what you need to beat (i get every milter-reject as BCC and there are no false positives)

0      58885    SPAM
0      21446    HAM
0    2465800    TOKEN

insgesamt 91M
-rw------- 1 sa-milt sa-milt 10M 2016-01-20 19:12 bayes_seen
-rw------- 1 sa-milt sa-milt 81M 2016-01-20 19:12 bayes_toks

BAYES_00        16128   69.62 %
BAYES_05          515    2.22 %
BAYES_20          636    2.74 %
BAYES_40          526    2.27 %
BAYES_50         1681    7.25 %
BAYES_60          283    1.22 %     7.69 % (OF TOTAL BLOCKED)
BAYES_80          308    1.32 %     8.37 % (OF TOTAL BLOCKED)
BAYES_95          204    0.88 %     5.54 % (OF TOTAL BLOCKED)
BAYES_99         2883   12.44 %    78.36 % (OF TOTAL BLOCKED)
BAYES_999        2600   11.22 %    70.67 % (OF TOTAL BLOCKED)

DELIVERED       32943   91.46 %
DNSWL           32577   90.44 %
SPF             24223   67.25 %
SPF/DKIM WL     11128   30.89 %
SHORTCIRCUIT    12812   35.57 %

BLOCKED          3679   10.21 %
SPAMMY           3678   10.21 %    99.97 % (OF TOTAL BLOCKED)

spamhaus.org               35681
inps.de                     9966
sorbs.net                   8436
barracudacentral.org        6940
thelounge.net               2224
manitu.net                   320
junkemailfilter.com          290
psbl.org                     208
spamcop.net                   35
mailspike.net                 19
=================================
Total DNSBL rejections:     64119


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to