Am 20.01.2016 um 17:52 schrieb Marc Perkel:
So - how do I get a list of words and phrases never used in spam? I create a list of words and phrases that are used in spam and check to see if it's *not on the list*. What I do is tokenize the spamiest parts of the email, like the subject line, into words and phrases of 1 2 3 and 4 word phrases. the quick brown fox jumps over the lazy dog - becomes "the" "quick" "the quick" "brown" "quick brown" "the quick brown" "fox" "brown fox" "quick brown fox" "the quick brown fox" "jumps" "fox jumps" "brown fox jumps" "quick brown fox jumps" "over" "jumps over" "fox jumps over" "brown fox jumps over" "the" "over the" "jumps over the" "fox jumps over the" "lazy" "the lazy" "over the lazy" "jumps over the lazy" "dog" "lazy dog" "the lazy dog" "over the lazy dog" These tokens are learned as ham or spam and added to sets. I'm using Redis to do this because it has extremely fast set operations. I don't know of anything other than Redis that can do this. So think about Redis as the way to implement this
so what's the difference to bayes?and for "I'm seeing close to 100% accuracy. It is so accurate it's scary and I think my implementation is crude at best" - my hand-trained and maintained bayes with adjusted scores is for sure what you need to beat (i get every milter-reject as BCC and there are no false positives)
0 58885 SPAM 0 21446 HAM 0 2465800 TOKEN insgesamt 91M -rw------- 1 sa-milt sa-milt 10M 2016-01-20 19:12 bayes_seen -rw------- 1 sa-milt sa-milt 81M 2016-01-20 19:12 bayes_toks BAYES_00 16128 69.62 % BAYES_05 515 2.22 % BAYES_20 636 2.74 % BAYES_40 526 2.27 % BAYES_50 1681 7.25 % BAYES_60 283 1.22 % 7.69 % (OF TOTAL BLOCKED) BAYES_80 308 1.32 % 8.37 % (OF TOTAL BLOCKED) BAYES_95 204 0.88 % 5.54 % (OF TOTAL BLOCKED) BAYES_99 2883 12.44 % 78.36 % (OF TOTAL BLOCKED) BAYES_999 2600 11.22 % 70.67 % (OF TOTAL BLOCKED) DELIVERED 32943 91.46 % DNSWL 32577 90.44 % SPF 24223 67.25 % SPF/DKIM WL 11128 30.89 % SHORTCIRCUIT 12812 35.57 % BLOCKED 3679 10.21 % SPAMMY 3678 10.21 % 99.97 % (OF TOTAL BLOCKED) spamhaus.org 35681 inps.de 9966 sorbs.net 8436 barracudacentral.org 6940 thelounge.net 2224 manitu.net 320 junkemailfilter.com 290 psbl.org 208 spamcop.net 35 mailspike.net 19 ================================= Total DNSBL rejections: 64119
signature.asc
Description: OpenPGP digital signature
